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1 Introduction 


1.1 Programming Model 


As with most computing systems, the Intel? Many Integrated Core (Intel? MIC) Architecture programming model can be 
divided into two categories: application programming and system programming. 


1.1.1 Application Programming 


In this guide, application programming refers to developing user applications or codes using either the Intel? Composer 
XE 2013 or 3 party software development tools. These tools typically contain a development environment that includes 
compilers, libraries, and assorted other tools. 


Application programming will not be covered here; consult the Intel? Xeon Phi™ Coprocessor DEVELOPER'S QUICK 
START GUIDE for information on how to quickly write application code and run applications on a development platform 
including the Intel? Many Integrated Core Architecture (Intel? MIC Architecture). It also describes the available tools and 
gives some simple examples to show how to get C/C++ and Fortran-based programs up and running. 


The development environment includes the following compilers and libraries, which are available at 
https://registrationcenter.intel.com: 


e Intel? C/C++ Compiler XE 2013 including Intel® MIC Architecture for building applications that run on Intel? 64 and 
Intel? MIC Architectures 

e Intel? Fortran Compiler XE 2013 including Intel® MIC Architecture for building applications that run on Intel? 64 and 
Intel? MIC Architectures 


Libraries for use with the offload compiler include: 


e Intel® Math Kernel Library (Intel? MKL) optimized for Intel? MIC Architecture 
e Intel® Threading Building Blocks 


The development environment includes the following tools: 


e Debugger 
= Intel® Debugger for applications including Intel? MIC Architecture 
= Intel® Debugger for applications running on Intel? Architecture (IA) 
e Profiling 
= SEP enables performance data collection from the Intel? Xeon Phi™ coprocessor. This feature is included as 
part of the VTune™ Amplifier XE 2013 tool. 
= Performance data can be analyzed using VTune"" Amplifier XE 2013 


1.1.2. System Programming 


System programming here explains how to use the Intel? MIC Architecture, its low level APIs (e.g. SCIF), and the 
contents of the Intel? Many Integrated Core Architecture Platform Software Stack (MPSS). Detailed information on these 
low-level APIs can be found in Section 5 of this document. 
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1.2 Section Overview 


The information in this guide is organized as follows: 


Section 2 contains a high-level description of the Intel? Xeon Phi™ coprocessor hardware and software 
architecture. 

Section Error! Reference source not found. covers power management from the software perspective. It also 
covers virtualization support in the Intel? Xeon Phi™ coprocessor and some Reliability Accessibility and 
Serviceability (RAS) features such as BLCR* and MCA. 

Section 4 covers Operating System support. 

Section 5 covers the low level APIs (e.g. SCIF) available with the Intel? Xeon Phi™ coprocessor software stack. 
Section 6 illustrates the usage models and the various operating modes for platforms with the Intel? Xeon Phi™ 
coprocessors in the compute continuum. 

Section 7 provides in-depth details of the Intel? Xeon Phi™ coprocessor Vector Processing Unit architecture. 
Glossary of terms and abbreviations used can be found in Section 8. 

References are collated in Section 9. 


1.3 Related Technologies and Documents 


This section lists some of the related documentation that you might find useful for finding information not covered here. 


Industry specification for standards (e, OpenMP*, OpenCL*, MPI, OFED*, and POSIX* threads) are not covered in this 
document. For this information, consult relevant specifications published by their respective owning organizations: 


Table 1-1. Related Industry Standards 


Technology Location 

OpemMP* http://openmp.or 

OpenCL* http://www.khronos.org/opencl 
MPI http://www.mpi-forum.or 
OFED* Overview http://www.openfabrics.or 


You should also consult relevant published documents which cover the Intel? software development tools not covered 


here: 
Table 1-2. Related Documents 
Document Location 
Intel? Xeon Phi™ Coprocessor DEVELOPER'S QUICK START | http://software.intel.com/en-us/mic-developer 
GUIDE 
Intel? Many Integrated Core Platform Software Stack http://software.intel.com/en-us/mic-developer 


Intel? Xeon Phi™ Coprocessor Instruction Set Architecture | http://software.intel.com/en-us/mic-developer 


Reference Manual 


An Overview of Programming for Intel? Xeon? processors http://software.intel.com/en-us/mic-developer 


and Intel? Xeon Phi™ coprocessors 


Debugging Intel? Xeon Phi™ Coprocessor: Command-Line | http://software.intel.com/en-us/mic-developer 


Debugging 

Building Native Applications for Intel? Xeon Phi™ http://software.intel.com/en-us/mic-developer 
Coprocessor 

Programming and Compiling for Intel? Many Integrated http://software.intel.com/en-us/mic-developer 


Core Architecture 
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Document Location 


Intel? Xeon Phi™ coprocessor Micro-architecture Software | http://software.intel.com/en-us/mic-developer 
Stack 


Intel? Xeon Phi™ coprocessor Micro-architecture Overview | http://software.intel.com/en-us/mic-developer 


Intel? MPI Library http://www. intel.com/go/mpi 
Intel? MIC SCIF API Reference Manual for Kernel Mode http://intel.com/software/mic 
Linux* 
Intel? MIC SCIF API Reference Manual for User Mode http://intel.com/software/mic 
Linux* 
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2 Intel® Xeon Phi™ Coprocessor Architecture 


This Section explains both the hardware and the software architecture of the Intel? Xeon Phi™ coprocessor. It covers the 
major micro-architectural features such as the core, the vector processing unit (VPU), the high-performance on-die 
bidirectional interconnect, fully coherent L2 caches, and how the various units interact. Particular emphasis is placed on 
the key parameters necessary to understand program optimization, such as cache organization and memory bandwidth. 


2.1 Intel? Xeon Phi™ Coprocessor Architecture 


The Intel? Xeon Phi™ coprocessor comprises of up to sixty-one (61) processor cores connected by a high performance 
on-die bidirectional interconnect. In addition to the IA cores, there are 8 memory controllers supporting up to 16 GDDR5 
channels delivering up to 5.5 GT/s, and special function devices such as the PCI Express* system interface. 


Each core is a fully functional, in-order core, which supports fetch and decode instructions from four hardware thread 
execution contexts. In order to reduce hot-spot contention for data among the cores, a distributed tag directory is 
implemented so that every physical address the coprocessor can reach is uniquely mapped through a reversible one-to- 
one address hashing function. This hashing function not only maps each physical address to a tag directory, but also 
provides a framework for more elaborate coherence protocol mechanisms than the individual cores could provide. 


Each memory controller is based on the GDDR5 specification, and supports two channels per memory controller. At up 
to 5.5 GT/s transfer speed, this provides a theoretical aggregate bandwidth of 352 GB/s (gigabytes per second) directly 
connected to the Intel? Xeon Phi™ coprocessor. 


At a high level, Intel? Xeon Phi™ coprocessor silicon is consists of up to 61 dual-issue in-order cores, where each core 
includes: 


e 512 bit wide vector processor unit (VPU) 
e The Core Ring Interface (CRI) 
e Interfaces to the Core and the Ring Interconnect 
e The L2 Cache (including the tag, state, data and LRU arrays) and the L2 pipeline and associated arbitration logic 
e The Tag Directory (TD) which is a portion of the distributed duplicate tag directory infrastructure 
e Asynchronous Processor Interrupt Controller (APIC) which receives interrupts (IPIs, or externally generated) and 
must redirect the core to respond in a timely manner. 
= Memory controllers (GBOX), which access external memory devices (local physical memory on the coprocessor 
card) to read and write data. Each memory controller has 2 channel controllers, which together can operate 
two 32-bit memory channels. 
= AGen2 PCI Express* client logic (SBOX), which is the system interface to the host CPU or PCI Express* switch, 
supporting x8 and x16 configurations. 
= The Ring Interconnect connecting all of the aforementioned components together on the chip. 
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Figure 2-1. Basic building blocks of the Intel® Xeon Phi™ Coprocessor 
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Table 2-1 gives a high-level description of each component. 
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Table 2-1. Description of Coprocessor Components 


Name Description 


Core The processor core. It fetches and decodes instructions from four hardware thread 
execution contexts. It supports a 32-bit and 64-bit execution environment similar to 
those found in the Intel64? Intel? Architecture Software Developer's Manual, along 
with the Intel Initial Many Core Instructions. . It contains a 32KB, 8-Way set associative 
L1 Icache and Dcache, and interfaces with the CRI/L2 block to request access to 


memory. The core can execute 2 instructions per clock cycle, one on the U-pipe, and 
one on the V-pipe. The V-pipe cannot execute all instruction types, and simultaneous 
execution is governed by pairing rules. The core does not support Intel? Streaming 
SIMD Extensions (Intel? SSE) or MMX™ instruction execution. 


VPU The Vector Processor Unit includes the EMU (extended math unit) and executes 16 
single-precision floating point, 16 32bit integer operations per clock cycle, or 8 double- 
precision floating-point operations per cycle. Each operation can be a floating-point 
multiply-add, giving 32 single precision floating-point operations per cycle. The VPU 
contains the vector register file (32 registers per thread context), and can read one of 


its operands directly from memory, including data format conversion on the fly. 
Broadcast and swizzle instructions are also available. The EMU can perform base-2 
exponential, base-2 logarithm, reciprocal, and reciprocal square root of single 
precision floating-point values. 


L2/CRI The Core-Ring Interface hosts the 512KB, 8-way, L2 cache and connects each core to 
an Intel? Xeon Phi™ coprocessor Ring Stop. Primarily, it comprises the core-private L2 
cache itself plus all of the off-core transaction tracking queues and transaction / data 


routing logic. Two other major blocks also live in the CRI: the R-Unit (APIC) and the Tag 
Directory (TD). 


TD Distributed duplicate tag directory for cross-snooping L2 caches in all cores. The CPU 
L2 caches are kept fully coherent with each other by the TDs, which are referenced 
after an L2 cache miss. A TD tag contains the address, state, and an ID for the owner 
(one of the L2 caches) of the cache line. The TD that is referenced is not necessarily the 


one co-located with the core that generated the miss, but is based upon address (each 
TD gets an equal portion of the address space). A request is sent from the core that 
suffered the memory miss to the correct TD via the ring interconnect. 


GBOX The Intel? Xeon Phi™ coprocessor memory controller comprises three main units: the 
FBOX (interface to the ring interconnect), the MBOX (request scheduler) and the PBOX 
(physical layer that interfaces with the GDDR devices). The MBOX comprises two CMCs 
(or Channel Memory Controllers) that are completely independent from each other. 


The MBOX provides the connection between agents in the system and the DRAM I/O 
block. It is connected to the PBOX and to the FBOX. Each CMC operates independently 
from the other CMCs in the system. 


SBOX PCI Express* client logic: DMA engine, limited power management capabilities 


Ring Interconnect, including component interfaces, ring stops, ring turns, addressing, 
and flow control. Intel? Xeon Phi™ coprocessor has 2 each of these rings — one 
travelling each direction. There is no queuing on the ring or in the ring turns; once a 
message is on the ring it will continue deterministically to its destination. In some 


cases, the destination does not have room to accept the message and may leave it on 
the ring and pick it up the next time it goes by. This is known as bouncing. 
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Description 


The PBOX is the analog interface component of the GBOX that communicates with the 
GDDR memory device. Besides the analog blocks, the PBOX contains the input/output 
FIFO buffers, part of the training state machines and mode registers to trim the analog 
interface. The analog interface consists of the actual I/O pads for DQs, Address and 
Command and the clocking structure. The PBOX also includes the GPLL which defines 
the clock domain for each PBOX and the respective MBOX/CBOX. 

Performance Monitoring Unit. This performance monitoring feature allows data to be 
collected from all units in the architecture, utilizing a P6-style programming interface 
to configure and access performance counters. Implements an Intel? Xeon Phi™ 
coprocessor SPFLT which allows user-level code to filter the core events that its 

thread generates. Does not implement some advanced features found in mainline IA 
cores (e.g. precise event-based sampling, etc.). 

The clock generation on Intel? Xeon Phi™ coprocessor supplies clocks to each of the 
four main clock domains. The core domain supports from 600 MHz to the part's 
maximum frequency in steps of 25 MHz Ratio changes in the core happen seamlessly 
and can be controlled through both software and internal hardware (using information 
from the thermal and current sensors on the card.) The GDDR supports frequencies 
that enable between 2.8 GT/s and the part's maximum frequency with a minimum step 
size of 50 MT/s. Intel? Xeon Phi™ coprocessors support frequency changes without 
requiring a reset. PCI Express* clock modes support both Gen1 and Gen2 operation. 
The external clock buffer has been incorporated into the Intel? Xeon Phi™ coprocessor 
die, and the clocks are sourced from two 100 MHz PCI Express* reference clocks. 


2.1.1 Core 


Each in-order execution core provides a 64 bit execution environment similar to that found in the Intel64® Intel® 
Architecture Software Developer's Guide, in addition to introducing support for Intel Initial Many Core Instructions. 
There is no support for MMX™ instructions, Intel Advanced Vector Extensions (Intel® AVX), or any of the Intel® 
Streaming SIMD Extensions (Intel® SSE). A full list of the instructions supported by the Intel? Xeon Phi™ coprocessor can 
be found in the following document (Intel? Xeon Phi™ Coprocessor Instruction Set Architecture Reference Manual 
(Reference Number: 327364)). New vector instructions provided by the Intel? Xeon Phi™ Coprocessor Instruction Set 
utilize a dedicated 512-bit wide vector floating-point unit (VPU) that is provided for each of the cores. 


Each core is connected to a Ring Interconnect via the Core Ring Interface (CRI), which is comprised of the L2 cache 
control and the Tag Directory (TD). The Tag Directory contains the tags for a portion of the overall L2 cache. The Core 
and L2 Slices are interconnected on a ring based interconnect along with additional ring agents on the die. Each agent on 
the ring, whether a core/L2 Slice, memory controller, or the system (SBOX), implements a ring stop that enables 
requests and responses to be sent on the ring bus. 


The core can execute 2 instructions per clock cycle, one on the U-pipe and one on the V-pipe. The V-pipe cannot execute 
all instruction types, and simultaneous execution is governed by pairing rules. Vector instructions can only be executed 
on the U-pipe. 
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Figure 2-2: Core Pipeline Components 


Figure 2-3: Intel® Xeon Phi™ Coprocessor Core Architecture 
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Most integer and mask instructions have a 1-clock latency, while most vector instructions have 4-clock latency with a 1 
clock throughput. Dependent store- to-load latency is 4 clocks for simple vector operations. "Shuffles" and "Swizzles" 
increase this latency. The store-to-load penalty for the L1 is approximately 12 clocks. Kunit (data cache) bounces cause 2 
dead clocks (bank conflicts, U-pipe/V-pipe conflicts with higher-priority replacements, invalidations). Prefix decodes are 
available with O-cycle "fast": 62, c4, c5, REX, Of, and a 2-cycle "slow": operand size 66, address size 67, lock, segment, 
REP. 


2.1.2 Instruction Decoder 


One of the changes made to simplify the core was to modify the instruction decoder to be a two-cycle unit. While fully 
pipelined, the result of this change is that the core cannot issue instructions from the same hardware context in back-to- 
back cycles. That is, if in cycle N the core issued instructions from context 1, then in cycle N *1 the core can issue 
instructions from any context except context 1. This allows for a significant increase in the maximum core frequency, 
resulting in a net performance gain even for single-threaded SPEC* benchmarks. 


For maximum chip utilization, at least two hardware contexts or threads must be run on each core. Since the scheduler 
cannot issue instructions in back-to-back cycles from the same hardware context, running one thread on a core will 
result in, at best, 5096 utilization of the core potential. 


2.1.3 Cache Organization and Hierarchy 


The Level One (L1) cache accommodates higher working set requirements for four hardware contexts per core. It has a 
32 KB L1 instruction cache and 32 KB L1 data cache. Associativity was increased to 8-way, with a 64 byte cache line. The 
bank width is 8 bytes. Data return can now be out-of-order. The L1 cache has a load-to-use latency of 1 cycle -- an 
integer value loaded from the cache can be used in the next clock by an integer instruction. Note, however, that vector 
instructions experience different latencies than integer instructions. The L1 cache has an address generation interlock 
with at least a 3-clock cycle latency. A GPR register must be produced three or more clocks prior to being used as a base 
or index register in an address computation. The register set-up time for base and index has the same 3-clock cycle 
latency. 


Another new feature is the 512 KB unified Level Two (L2) cache unit. The L2 organization comprises 64 bytes per way 
with 8-way associativity, 1024 sets, 2 banks, 32GB (35 bits) of cacheable address range and a raw latency of 11 clocks. 
The expected idle access time is approximately 80 cycles. The L2 cache has a streaming hardware prefetcher that can 
selectively prefetch code, read, and RFO (Read-For-Ownership) cache lines into the L2 cache. There are 16 streams that 
can bring in up to a 4-KB page of data. Once a stream direction is detected, the prefetcher can issue up to 4 multiple 
prefetch requests. The L2 in Intel? Xeon Phi™ coprocessor supports ECC, and power states such as the core C1 (shuts off 
clocks to the core and the VPU), C6 (shuts off clocks and power to the core and the VPU), and the package C3 states. The 
replacement algorithm for both the L1 and L2 caches is based on a pseudo-LRU implementation. 


The L2 cache is part of the Core-Ring Interface block. This block also houses the tag directory (TD) and the Ring Stop (RS), 
which connects to the interprocessor core network. Within these sub-blocks is the Transaction Protocol Engine which is 
an interface to the RS and is equivalent to a front side bus unit. The RS handles all traffic coming on and off the ring. The 
TDs, which are physically distributed, filter and forward requests to appropriate agents on the ring. They are also 
responsible for initiating communications with the GDDR5 memory via the on-die memory controllers. 


In the in-order Intel? Pentium? processor design, any miss to the cache hierarchy would be a core-stalling event such 
that the program would not continue executing until the missing data were fetched and ready for processing. In the 
Intel? Xeon Phi™ coprocessor cores, a miss in the L1 or L2 cache does not stall the entire core. Misses to the cache will 
not stall the requesting hardware context of a core unless it is a load miss. Upon encountering a load miss, the hardware 
context with the instruction triggering the miss will be suspended until the data are brought into the cache for 
processing. This allows the other hardware contexts in the core to continue execution. Both the L1 and L2 caches can 
also support up to about 38 outstanding requests per core (combined read and write). The system agent (containing the 
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PCI Express* agent and the DMA controller) can also generate 128 outstanding requests (read and write) for a total of 
38*(number of cores) + 128. This allows software to prefetch data aggressively and avoids triggering a dependent stall 
condition in the cache. When all possible access routes to the cache are in use, new requests may cause a core stall until 
a slot becomes available. 


Both the L1 and L2 caches use the standard MESI protocol for maintaining the shared state among cores. The normal 
MESI state diagram is shown in Figure 2-4 and the cache states are listed in Table 2-2. L2 Cache States. 
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Other Read 


Other Read 


Figure 2-4: MESI Protocol 


Table 2-2. L2 Cache States 


L2 Cache State | State Definition 


To address potential performance limitations resulting from the lack of an O (Owner) state found in the MOESI protocol, 
the Intel? Xeon Phi™ coprocessor coherence system has an ownership tag directory (TD) similar to that implemented in 
many multiprocessor systems. The tag directory implements the GOLS3 protocol. By supplementing the individual core 
MESI protocols with the TD's GOLS protocol, it becomes possible to emulate the missing O-state and to achieve the 
benefits of the full MOESI protocol without the cost of redesigning the local cache blocks. The TD is also useful for 
controlling other behaviors in the Intel? Xeon Phi™ coprocessor design and is used for more than this emulation 
behavior. The modified coherence diagrams for the core MESI protocol and the tag directory GOLS protocol are shown 
in Figure 2-5. 
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Figure 2-5 Globally Owned Locally Shared (GOLS) Diagram 


Table 2-3. Tag Directory States 


Tag Directory State State Definition 


Globally Owned, Locally Shared. Cacheline is present in 
one or more cores, but is not consistent with memory. 


Globally Shared. Cacheline is present in one or more 
cores and consistent with memory. 


Globally Exclusive/Modified. Cacheline is owned by one 


and only one core and may or may not be consistent 
with memory. The Tag Directory does not know 
whether the core has actually modified the line. 


Globally Invalid. Cacheline is not present in any core. 
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The tag directory is not centralized but is broken up into 64 distributed tag directories (DTDs). Each DTD is responsible 
for maintaining the global coherence state in the chip for its assigned cache lines. The basic L1 and L2 cache parameters 
are summarized in Table 2-4. Two unusual fields in this table are the Duty Cycle and Ports designations, which are 
specific only to the Intel? Xeon Phi™ coprocessor design. The L1 cache can be accessed each clock, whereas the L2 can 
only be accessed every other clock. Additionally, on any given clock software can either read or write the L1 or L2, but it 
cannot read and write in the same clock. This design artifact has implications when software is trying to access a cache 
while evictions are taking place. 

Table 2-4. Cache Hierarchy 


Parameter LI Le 
Coherence MESI MESI 
Size 32 KB + 32 KB 512 KB 
Associativity 8-way 8-way 
Line Size 64 bytes 64 bytes 
Banks 8 8 
Access Time 1 cycle 11 cycles 
Policy pseudo LRU pseudo LRU 
Duty Cycle 1 per clock 1 per clock 
Ports Read or Write Read or Write 


The L2 cache organization per core is inclusive of the L1 data and instruction caches. How all cores work together to 
make a large, shared, L2 global cache (up to 31 MB) may not be clear at first glance. Since each core contributes 512 KB 
of L2 to the total shared cache storage, it may appear as though a maximum of 31 MB of common L2 cache is available. 
However, if two or more cores are sharing data, the shared data is replicated among the individual cores’ various L2 
caches. That is, if no cores share any data or code, then the effective total L2 size of the chip is 31 MB. Whereas, if every 
core shares exactly the same code and data in perfect synchronization, then the effective total L2 size of the chip is only 
512 KB. The actual size of the workload-perceived L2 storage is a function of the degree of code and data sharing among 
cores and thread. 


A simplified way to view the many cores in Intel® Xeon Phi™ coprocessor is as a chip-level symmetric multiprocessor 
(SMP). Each core acts as a stand-alone core with 512 KB of total cache space, and up to 62 such cores share a high-speed 
interconnect on-die. While not particularly accurate compared to a real SMP implementation, this simple mental model 
is useful when considering the question of how much total L2 capacity may be used by a given workload on the Intel® 
Xeon Phi™ coprocessor card. 


2.1.4 Page Tables 


The Intel® Xeon Phi™ coprocessor supports 32-bit physical addresses in 32-bit mode, 36-bit physical address extension 
(PAE) in 32-bit mode, and 40-bit physical address in 64-bit mode. 


It supports 4-KB and 2-MB page sizes. It also supports the Execute Disable (NX) bit. But there is no support for the Global 
Page bit, unlike other Intel? Architecture microprocessors. On a TLB miss, a four-level page table walk is performed as 
usual, and the INVLPG instruction works as expected. The advantage of this approach is that there are no restrictions for 
mixing the page sizes (4 KB, 2MB) within a single address block (2MB). However, undefined behavior will occur if the 16 
underlying 4-KB page-table entries are not consistent. 


Each L1 data TLB (dTLB) has 64 entries for 4 KB pages and 8 entries for 2MB pages. Each core also has one instruction 
TLB (iTLB), which only has 32 entries for 4 KB pages. No support for larger page sizes is present in the instruction TLB. For 
L2, the 4-way dTLB has 64 entries, usable as second-level TLB for 2M pages or as a page directory entry (PDE) cache for 
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4K. TLBs can share entries among threads that have the same values for the following registers: CR3, CRO.PG, CR4.PAE, 
CR4.PSE, EFER.LMA. 
Table 2-5. L1 and L2 Caches Characteristics 


Page Size Entries Associativity Maps 
4K 64 4-way 256K 

L1 Data TLB 
= 2M 8 4-way 16M 
L1 Instruction TLB 4K 32 4-way 128K 
L2 TLB 4K, 2M 64 4-way 128M 


The Intel® Xeon Phi™ coprocessor core implements two types of memory: uncacheable (UC) and write-back (WB). The 
other three memory forms [write-through (WT), write-combining (WC), and write-protect (WP)] are mapped internally 
to microcontroller behavior. No other memory type is legal or supported. 


2.1.5 Hardware Threads and Multithreading 


Figure 2-6 presents a high-level view of the major impacts for hardware multithreading support, such as architectural, 
pipeline, and cache interactions. This includes replicating complete architectural state 4 times: the GPRs, STO-7, segment 
registers, CR, DR, EFLAGS, and EIP. Certain micro-architectural states are also replicated four times like the prefetch 
buffers, the instruction pointers, the segment descriptors, and the exception logic. “Thread specific” changes include 
adding thread ID bits to shared structures (iTLB, dTLB, BTB), converting memory stall to thread-specific flush, and the 
introduction of thread wakeup/sleep mechanisms through microcode and hardware support. Finally, the Intel® Xeon 
Phi™ coprocessor implements a “smart” round-robin multithreading. 
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Figure 2-6. Multithreading Architectural Support in the Intel? Xeon Phi™ Coprocessor 


Each of four hardware threads shown above in the grey shaded region has a "ready to run" buffer consisting of two 
instruction bundles. Since each core is capable of issuing two instructions per clock cycle, each bundle represents two 
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instructions. If the executing thread has a control transfer to a target that is not contained in this buffer, it will trigger a 
miss to the instruction cache, which flushes the context buffer and loads the appropriate target instructions. If the 
instruction cache does not have the control transfer point, a core stall will be initiated, which may result in performance 
penalties. In general, whichever hardware context issues instructions in a given clock cycle has priority for fetching the 
next instruction(s) from the instruction cache. Another significant function is the picker function (PF) that chooses the 
next hardware context to execute. The PF behaves in a round-robin manner, issuing instructions during any one clock 
cycle from the same hardware context only. In cycle N, if the PF issues instruction(s) from Context 3, then in cycle N +1 
the PF will try to issue instructions from Context 0, Context 1, or Context 2 — in that order. As previously noted it is not 
possible to issue instructions from the same context (Context 3 in this example) in back-to-back cycles. 


2.1.6 Faults and Breakpoints 


The Intel? Xeon Phi™ coprocessor supports the fault types shown in Table 2-6 below. For complete details of fault 
behavior, please consult the (Intel? 64 and IA-32 Architectures Software Developer Manuals). 


Breakpoint support required the widening of DRO-DR3 for Intel? 64 instruction compatibility and is now for 1, 2, 4, or 8 
bytes. The length was not extended to support 16, 32, or 64 bytes. Also, breakpoints in the Intel? Xeon Phi™ coprocessor 


instructions occur regardless of any conditional execution status indicated by mask registers. 


Table 2-6. Supported and Unsupported Faults on Intel? Xeon Phi™ Coprocessor 


Fault Type Supported | Comments 


HPF Yes Page Fault 
#SS Yes For non-canonical and referencing SS segment 
#GP Yes Address is not canonical or not aligned to operand size 


If CRO.EM[2] = 1, or LOCK or REX prefix used; also triggered 


SUD VES on IN or OUT instructions 

HXF No No unmasked exceptions in SIMD 
HAC No GP fault always takes priority 
HNM No CRO.TS[3] = 1 


2.1.7 Performance Monitoring Unit and Events Monitor 


The Intel? Xeon Phi™ coprocessor includes a performance monitoring unit (abbreviated as PMU) like the original Intel® 
Pentium? processor core. Most of the 42 event types from the original Intel? Pentium? processor exist, although the 
PMU interface has been updated to reflect more recent programming interfaces. Particular Intel? Xeon Phi™ 
coprocessor-centric events have been added to measure memory controller events, vector processing unit utilization 
and statistics, local and remote cache read/write statistics, and more. 


The Intel? Xeon Phi™ coprocessor comes with support for performance monitoring at the individual core level. Each core 
has four performance counters, four filtered counters, and four event select registers. The events supported for 
performance monitoring are a combination of the legacy Intel? Pentium® processor events and new Intel® Xeon Phi™ 
coprocessor-centric events. Each core PMU is shared amongst all four hardware threads in that core. The PMU in each 
core is responsible for maintaining the time stamp counter (TSC) and counts hardware events generated by the core or 
triggered by events arriving at the core. By default, events are counted for all hardware contexts in the core, but the 
filters may be set to count only specific hardware context events. The core PMU also receives events from co-located 
units, including the ring stop, the distributed tag directory, and the core-ring interface. 


The Intel? Xeon Phi™ coprocessor switched to the Intel? Pentium? Pro processor style of PMU interface, which allows 
user-space (ring three) applications to directly interface with and use the PMU features via specialized instructions such 
as RDPMCA. In this model, Ring O still controls the PMU but Ring 3 is capable of interacting with exposed features for 
optimization. 
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Table 2-7 lists the instructions used by Ring O and Ring 3 code used to control and query the core PMU. 


Table 2-7: Core PMU Instructions 


Privilege 


Description Mode | ghread- 


Instruction 


Name (CPL) Specific 


Read model specific register. ECX: Address of MSR | EDX:EAX = 64-bit MSR 
Used by Ring 0 code to read value 
any core PMU register. 


Write model specific register. EDX:EAX = 64-bit 
Used by Ring 0 code to write MSR value 


to any core PMU register. ECX: Address of MSR 


Read timestamp counter. EDX:EAX = 64-bit 
Reads the current timestamp timestamp value 
counter value. 


Read performance-monitoring ECX: Counter # EDX:EAX = Zero-extended 
counter. Reads the counts of 0x0: IA32 PerfCntro 40-bit counter value 

any of the performance 0x1: IA32 PerfCntr 

monitoring counters, including = 

the PMU filtered counters. 


Set user preference flag to Any GPR[0]: Set/clear USER PREF bit in 
indicate counter 0x0: Clear (disable) PERF SPFLT CONTROL. 
enable/disable. 0x1: Set (enable) 


The instructions RDMSR, WRMSR, RDTSC, and RDPMC are well-documented in the (Intel® 64 and IA-32 Architectures 
Software Developer Manuals). The only Intel? MIC Architecture-specific notes are that RDTSC has been enhanced to 
execute in 4-5 clock cycles and that a mechanism has been implemented to synchronize timestamp counters across the 
chip. 


SPFLT is unique because it allows software threads fine-grained control in enabling/disabling the performance counters. 
The anticipated usage model for this instruction is for instrumented code to enable/disable counters around desired 
portions of code. Note that software can only specify its preference for enabling/disabling counters and does not have 
control over which specific counters are affected (this behavior supports virtualization). The SPFLT instruction can only 
be executed while the processor is in Intel? 64-bit mode. 


Table 2-8 lists the model-specific registers used to program the operation of the core PMU. 


Table 2-8. Core PMU Control Registers 


Register 


Address Description Threaded? 


0x10 | 16 | MSR TIME STAMP COUNTER | Timestamp Counter 64 
0x20 32 MSR PerfCntrO Events Counted, core PMU counter 0 Yes 40 
0x21 33 MSR _PerfCntr1 Events Counted, core PMU counter 1 Yes 40 


Performance Event Selection and configuration 
0x28 MSR _PerfEvtSel0 register for IA32 PerfCntrO. Yes 32 


Performance Event Selection and configuration 
0x29 e MSR _PerfEvtSel1 register for IA32 PerfCntr1. Yes 


SPFLT Control Register. This MSR controls the 
effect of the SPFLT instruction and whether it will 
allow software fine-grained control to 
enable/disable IA32 PerfCntrN. 


MSR PERF SPFLT CONTROL 
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Register SE 5 


Counter Overflow Status. This read-only MSR 
displays the overflow status of all the counters. 
Each bit is implemented as a sticky bit, set by a 
counter overflow. 


Counter Overflow Control. This write-only MSR E 


0x2D 45 MSR PERF GLOBAL STATUS Yes 
clears the overflow indications in the Counter Yes 
es 


MSR _PERF_GLOBAL_OVF_CTRL 


Overflow Status register. For each bit that is set, 
the corresponding overflow status is cleared. 


Master PMU Enable. Global PMU enable / 

disable. When these bits are set, the core PMU is 

permitted to count events as configured by each of 

the Performance Event Selection registers (which 
DEER can each be independently enabled or disabled). P 

When these bits are cleared, performance 

monitoring is disabled. The operation of the 


Timestamp Counter is not affected by this register. 


For a description of the complete set of Intel? Xeon Phi™ coprocessor PMU and EMON registers and its performance 
monitoring facilities, please see the document (Intel? Xeon Phi™ Coprocessor Performance Monitoring Units, Document 
Number: 327357-001, 2012). 


2.1.7.1 Timestamp Counter (TSC) 


The RDTSC instruction that is used to access IA32 TIMESTAMP COUNTER can be enabled for Ring 3 (user code) by 
setting CRA[2]. 


This behavior enables software (including user code) to use IA32 TIMESTAMP COUNTER as a wall clock timer. The Intel® 
Xeon Phi™ coprocessor only supports this behavior in a limited configuration (P1 only) and not across different P-states. 
The Intel? Xeon Phi™ coprocessor will increment IA32 TIMESTAMP COUNTER based on the current core frequency but 
the Intel? Xeon Phi™ coprocessor will not scale such MSRs across package C-states. 


For Intel? Xeon Phi™ coprocessor performance analysis, the IA32 TIMESTAMP COUNTER feature always works on P1 
and standard behavior is expected so that any new or pre-existing code using RDTSC will obtain consistent results. 
However, P-states and package C-states must be disabled during fine-grained performance analysis. 


2.1.8 System Interface 


The System Interface consists of two major units: the Intel? Xeon Phi™ coprocessor System Interface and the 
Transaction Control Unit (TCU). The SI contains all of the PCI Express* logic, which includes the PCI Express* protocol 
engine, SPI for flash and coprocessor OS loading, 1°C for fan control, and the APIC logic. The TCU bridges the coprocessor 
SI to the Intel? Xeon Phi™ coprocessor internal ring, and contains the hardware support for DMA and buffering with 
transaction control flow. This block includes the DMA controllers, the encryption/decryption engine, MMIO registers, 
and various flow-control queuing instructions that allow internal interface to the ring transaction protocol engine. 


2.1.8.1 PCI Express 


The Intel? Xeon Phi™ coprocessor card complies with the Gen2x16 PCI Express* and supports 64 to 256 byte packet. PCI 
Express* peer-to-peer writes and reads are also supported. 
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The following registers show the Intel? Xeon Phi™ coprocessor PCI Express configuration setting: 


e PCIE PCIE CAPABILITY Register (SBOX MMIO offset 0x584C) 


Bits Type | Reset | Description 
RO 0x0 Device/Port Type 


23:20 
other bits unmodified 


e PCIE BAR ENABLE Register (SBOX MMIO offset 0x5CD4) 


Bits | Type | Reset Description 
0 RW |1 MEMBARO (Aperture) Enable 
1 RW |1 MEMBAR1 (MMIO Registers) Enable 
2 RW |0 I/O BAR Enable 
3 RW 0 EXPROM BAR Enable 
31:4 | Rsvd | O 


2.1.8.2 Memory Controller 


There are 8 on-die GDDR5-based memory controllers in the Intel? Xeon Phi'" coprocessor. Each can operate two 32-bit 
channels for a total of 16 memory channels that are capable of delivering up to 5.5 GT/s per channel. The memory 
controllers directly interface to the ring interconnect at full speed, receiving complete physical addresses with each 
request. It is responsible for reading data from and writing data to GDDR memory, translating the memory read and 
write requests into GDDR commands. All the requests coming from the ring interface are scheduled by taking into 
account the timing restrictions of the GDDR memory and its physical organization to maximize the effective bandwidth 
that can be obtained from the GDDR memory. The memory controller guarantees a bounded latency for special requests 
arriving from the SBOX. The bandwidth guaranteed to the SBOX is 2 GB/s. The MBOX communicates to the FBOX (the 
ring interface) and the PBOX (the physical interface to the GDDR). The MBOX is also responsible for issuing all the refresh 
commands to the GDDR. 


The GDDR5 interface supports two major data integrity features: Parity on the Command/Address interface and an 
optional software-based ECC for data. 


2.1.8.2.1 DMA Capabilities 


Direct Memory Access (DMA) is a common hardware function within a computer system that is used to relieve the CPU 
from the burden of copying large blocks of data. To move a block of data, the CPU constructs and fills a buffer, if one 
doesn’t already exist, and then writes a descriptor into the DMA Channel’s Descriptor Ring. A descriptor describes details 
such as the source and target memory addresses and the length of data in cache lines. The following data transfers are 
supported: 


e Intel® Xeon Phi™ coprocessor to Intel? Xeon Phi™ coprocessor GDDRS space (aperture) 

e Intel® Xeon Phi™ coprocessor GDDR5 to host System Memory 

e Host System Memory to Intel? Xeon Phi™ coprocessor GDDRS (aperture or non-aperture) 
e  |ntra-GDDR5 Block Transfers within Intel? Xeon Phi™ coprocessor 


A DMA Descriptor Ring is programmed by either the coprocessor OS or the Host Driver. Up to eight Descriptor Rings can 
be opened by software; each being referred to as a DMA Channel. The coprocessor OS or Host Driver can open a DMA 
Channel in either system or GDDR5 memory respectively; that is, all descriptor rings owned by the host driver must exist 
in system memory while rings owned by the coprocessor OS must exist in GDDR5 memory. A programmable arbitration 
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scheme resolves access conflicts when multiple DMA Channels vie for system or Intel? Xeon Phi™ coprocessor 
resources. 


The Intel? Xeon Phi™ coprocessor supports host-initiated or device-initiated PCI Express* Gen2/Gen1 memory, I/O, and 
configuration transactions. The Intel? Xeon Phi™ coprocessor device-initiated memory transactions can be generated 
either from execution cores directly or by using the DMA engine in the SBOX. 


In summary, the DMA controller has the following capabilities: 


e 8DMA channels operating simultaneously, each with its own independent hardware ring buffer that can live in 
either local or system memory 

e Supports transfers in either direction (host / Intel? Xeon Phi™ coprocessor devices) 

e Supports transfers initiated by either side 

Always transfers using physical addresses 

Interrupt generation upon completion 

e 64-byte granularity for alignment and size 

e Writing completion tags to either local or system memory 


The DMA block operates at the core clock frequency. There are 8 independent channels which can move data: 


e From GDDR5 Memory to System Memory 
e From System Memory to GDDR5 Memory 
e From GDDR5 Memory to GDDR5 Memory 


The Intel? Xeon Phi™ coprocessor not only supports 64-bytes (1 cache line) per PCI Express* transaction, but up to a 
maximum of 256 bytes for each DMA-initiated transaction. This requires that the Root-Complex support 256 byte 
transactions. Programming the MAX PAYLOAD SIZE in the PCI COMMAND STATUS register sets the actual size of each 
transaction. 


2.1.8.2.1.1 DMA Channel Arbitration 


There is no notion of priority between descriptors within a DMA Channel; descriptors are fetched, and operated on, in a 
sequential order. Priority between descriptors is resolved by opening multiple DMA channels and performing arbitration 
between DMA channels in a round-robin fashion. 


2.1.8.2.1.2 Descriptor Ring Overview 


A Descriptor Ring is a circular buffer as shown in Figure 2-7. The length of a Descriptor Ring can be up to 128K entries, 
and must align to the nearest cache line boundary. Software manages the ring by advancing a Head Pointer as it fills the 
ring with descriptors. When the descriptors have been copied, it writes this updated Header Pointer into the DMA Head 
Pointer Register (DHPRO — DHPR7) for the appropriate DMA Channel. Each DMA Channel contains a Tail Pointer that 
advances as descriptors are fetched into a channel's Local Descriptor Queue. The Descriptor Queue is 64 entries, and can 
be thought of as a sliding window over the Descriptor Ring. The Tail Pointer is periodically written back to memory so 
that software can track its progress. Upon initialization, software sets both the Head Pointer and Tail Pointer to point to 
the base of the Descriptor Ring. From the DMA Channel perspective, an empty state is approached when the Tail Pointer 
approaches the Head Pointer. From a software perspective, a full condition is approached when the Head Pointer 
approaches the Tail Pointer. 


The Head and Tail Pointers are 40-bit Intel? Xeon Phi™ coprocessor addresses. If the high-order bit is a 1, the 
descriptors reside in system memory; otherwise they reside in the Intel? Xeon Phi™ coprocessor memory. Descriptors 
come in five different formats and are 16 bytes in length. There are no alignment restrictions when writing descriptors 
into the ring. However, performance is optimized when descriptors start and end on cache line boundaries because 
memory accesses are performed on cache line granularities, four descriptors at a time. 
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Figure 2-7. DMA Channel Descriptor Ring plus Local Descriptor Queue 


2.1.8.2.1.3 Descriptor Ring Setup 


Figure 2-8 shows how the Descriptor Ring Attributes Register or DRAR sets ups the Descriptor Ring in each DMA channel. 
Because a descriptor ring can vary in size, the Base Address (BA) represents a 36-bit index. The Tail Pointer Index is 
concatenated to the BA field to form up a Tail Pointer to the GDDR space. If the descriptor ring resides in system 
memory, BA[35] and BA[34] will be truncated to correspond with the 16GB system-memory page as shown in Figure 2-9. 
The Sys bit must be set along with a valid system-memory page number. 
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Figure 2-8. Descriptor Ring Attributes 
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Figure 2-9. Intel® Xeon Phi™ Coprocessor Address Format 


Because the size of the Descriptor Ring can vary, the Base Address must provide adequate space for concatenation of 
the Tail Pointer Index by zeroing out all the low-order bits that correspond to the size as shown in Figure 2-9. Table 2-9 
gives some examples of the base address ranges based on the size of the descriptor ring. 


Because the Head Pointer Index is updated by software, checks are made to determine if the index falls within the range 
specified by the size. An error will be generated if the range is exceeded. 


Page 30 


Table 2-9. Examples of Base Address Ranges Based on Descriptor Ring Size 


Size Base Address Range Tail Pointer Range 
0x0004 (4) 0x0 0000 0000: OxF FFFF FFFC OxO 0000 : 0x0003 
0x0008 (8) 0x0 0000 0000: OxF FFFF FFF8 OxO 0000 : 0x0007 
0x000C (12) 0x0 0000 0000: OxF FFFF FFFO OxO 0000 : 0x000B 
0x0010 (16) 0x0 0000 0000 : OxF FFFF FFFO OxO 0000 : 0x000F 
0x0018 (24) 0x0 0000 0000 : OxF FFFF FFEO OxO 0000 : 0x0017 
0x0100 (256) 0x0 0000 0000: OxF FFFF FFOO OxO 0000 : OxOOFF 
0x0400 (1024) 0x0 0000 0000: OxF FFFF FCOO0 OxO 0000 : OxOSFF 
0x1000 (4096) OxO 0000 0000: OxF FFFF F000 0x0 0000 : OxOFFF 


35 n+1 n 654321 0 
Base Address (BA) All Os 0|0/0|0|0]|0 
Ra sf 
Sa 
Size 


Figure 2-10. Base Address Width Variations 


Figure 2-11 shows the Head and Tail Pointer index registers used to access the descriptor ring. Both pointers are indexes 
into the descriptor ring relative to the base, not to Intel? Xeon Phi™ coprocessor addresses. Both indexes are on 
descriptor boundaries and are the same width as the Size field in the DRAR. For the Tail Pointer Address, the DMA uses 
the TPI along with the Sys bit, Page, and Base Address in the DRAR. 


31 17 16 0 


RESD Head Pointer Index (HPI) 


31 17 16 0 


RESD Tail Pointer Index (TPI) 


Figure 2-11 Head and Tail Pointer Index Registers 


2.1.8.2.2 Interrupt Handling 


There are three different types of interrupt flows that are supported in the Intel® Xeon Phi™ coprocessor: 

Local Interrupts — These are the interrupts that are destined for one (or more) of the Intel® Xeon Phi™ coprocessor cores 
located on the originating device. They appear in the form of APIC messages on the APIC serial bus. 

Remote Interrupts — These are the interrupts which are destined for one (or more) of the Intel® Xeon Phi™ coprocessor 
cores in other Intel® Xeon Phi™ coprocessor devices. They appear as MMIO accesses on the PEG port. 

System Interrupts — These are the interrupts which are destined for the host processor(s). They appear as INTx/MSI/MSI- 
X messages on the PEG port, depending upon the PCI configuration settings. 
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2.1.8.2.3 Intel® Xeon Phi™ Coprocessor Memory Space 


Table 2-10 lists the starting addresses assigned to specific functions. 


Table 2-10. Coprocessor Memory Map 


Function Starting Address Size (Bytes) | Comment 

GDDR5 Memory 00 0000 0000 Variable 

System Memory Variable Addresses translated through SMPT 

Flash Memory 00 FFF8 5000 364K Actual size of flash varies, Some parts are not 
accessible through the normal memory path 

MMIO Registers 00 007D 0000 64K Accessibility from the host is limited 

Boot ROM 00 FFFF 0000 64K New for Intel® Xeon Phi™ coprocessor. Overlays 


FBOOTO image in flash 
Fuse Block 00 FFF8 4000 4K New for Intel? Xeon Phi" coprocessor memory space 


2.1.8.2.3.1 Host-Visible Intel? Xeon Phi™ Coprocessor Memory Space 


After Reset, all GDDR5 memory sits inside "stolen memory" (that is, memory not accessible by the Host). Stolen memory 
(CP MEM BASE/TOP) has precedence over the PCI Express* aperture. FBOOT1 code will typically shrink stolen memory 
or remove it. The aperture is programmed by the host or the coprocessor OS to create a flat memory space. 


2.1.8.2.3.2 Intel? Xeon Phi™ Coprocessor Boot ROM 
The Intel? Xeon Phi™ coprocessor software boot process is summarized below: 


After Reset: Boot-Strap Processor (BSP) executes code directly from the 1st-Stage Boot-Loader Image (FBOOTO). 
FBOOTO authenticates 2nd-Stage Boot-Loader (FBOOT1) and jumps to FBOOT1. 

FBOOT1 sets up/trains GDDR5 and basic memory map. 

FBOOT1 tells host to upload coprocessor OS image to GDDR5. 

FBOOT1 authenticates coprocessor OS image. If authentication fails, FROOT1 locks out specific features. 
FBOOT1 jumps to coprocessor OS. 


gun ou mors 


2.1.8.2.3.3 SBOX MMIO Register Space 


The SBOX contains 666 MMIO (Memory-Mapped I/O) registers (12 K bytes) that are used for configuration, status and 
debug of the SBOX and other parts of the rest of Intel? Xeon Phi™ coprocessor. These are sometimes referred to as 
CSR's and are not part of the PCI Express* configuration space. The SBOX MMIO space is located at O8 007D 0000h- 
08 007D FFFFh in the Intel? Xeon Phi™ coprocessor memory space. These MMIO registers are not contiguous, but are 
split between various functional blocks within the SBOX. Accessibility is always allowed to the coprocessor OS while 
accessibility by the host is limited to a subset for security. 


2.1.9 VPU and Vector Architecture 


The Intel? Xeon Phi™ coprocessor has a new SIMD 512-bit wide VPU with a corresponding vector instruction set. The 
VPU can be used to process 16 single precision or 8 double precision elements. There are 32 vector registers (8 mask 
registers with per lane predicated execution). Prime (hint) instructions for scatter/gather are available. Load operation 
comes from 2-3 sources to 1 destination. There are new SP transcendental instructions supported in hardware for 
exponent, logarithm, reciprocal, and square root operations. The VPUs are mostly IEEE 754 2008 floating-point 
compliant with added SP, DP-denorm, and SAE support for IEEE compliance and improved performance on fdiv/sqrt. 
Streaming stores (no read for ownership before write) are available with the vmovaps/pd.nr and vmovaps/pd.ngo 
instructions. 
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Section 7 contains more detailed information on the vector architecture. 


2.1.10 Intel? Xeon Phi™ Coprocessor Instructions 


The Intel? Xeon Phi™ coprocessor instruction set includes new vector instructions that are an extension of the existing 
Intel® 64 ISA. However, they do not support the Intel Architecture family of vector architecture models (MMX"" 
instructions, Intel? Streaming SIMD Extensions, or Intel? Advanced Vector Extensions). 


The major features of the Intel? Xeon Phi™ coprocessor vector ISA extensions are: 


A new instruction repertoire specifically tailored to boost the performance of High Performance Computing (HPC) 
applications. The instructions provide native support for both float32 and int32 operations while providing a rich 
set of conversions for common high performance computing native data types. Additionally, the Intel? Xeon Phi™ 
coprocessor ISA supports float64 arithmetic and int64 logic operations. 

There are 32 new vector registers. Each is 512 bits wide, capable of packing 16 32-bit elements or 8 64-bit elements 
of floating point or integer values. A large and uniform vector register file helps in generating high performance 
code and covering longer latencies. 

Ternary instructions with two sources and different destinations. There are also Fused Multiply and Add (FMA) 
instructions which are ternary with three sources, one of which is also the destination. 

Intel? Xeon Phi™ coprocessor instructions introduce 8 vector mask registers that allow conditional execution over 
the 16 elements in a vector instruction and merged results to the original destination. Masks allow vectorizing 
loops that contain conditional statements. Additionally, support is provided for updating the value of the vector 
masks with special vector instructions such as vcmpps. 

The vector architecture supports a coherent memory model wherein the new set of instructions operates in the 
same memory address space as the standard Intel? 64 instructions. This feature eases the process of developing 
vector code. 

Specific gather/scatter instructions manipulate irregular data patterns in memory (by fetching sparse locations of 
memory into a dense vector register or vice-versa) thus enabling vectorization of algorithms with complex data 
structures. 


Consult the (Intel? Xeon Phi™ Coprocessor Instruction Set Architecture Reference Manual (Reference Number: 327364)) 
for complete details on the Intel? Xeon Phi™ coprocessor instructions. 


2.1.11 Multi-Card 


Each Intel? Xeon Phi™ coprocessor device is treated as an independent computing environment. The host OS 
enumerates all the cards in the system at boot time and launches separate instances of the coprocessor OS and the SCIF 
driver. See the SCIF documentation for more details about intercard communication. 
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2.1.12 Host and Intel? MIC Architecture Physical Memory Map 


preme TETM 


e S1 : 
g å E 
aM. ME 
: 
à 
4 
å 
ES 
|] 


20 UI g ate tepes 


tO229200X* 


Td 


f 


å 
64 GB Copracessar MWO Space 

a OI emt 

i 

8 

å 


0660000000 


Figure 2-12. Host and Intel? MIC Architecture Physical Memory Map 


The Intel? Xeon Phi™ coprocessor memory space supports 40-bit physical addresses, which translates into 1024 GiB of 
addressable memory space that is split into 3 high-level ranges: 


e Local address range: 0x00 0000 0000 to OxOF_FFFF_FFFF (64 GiB) 
e Reserved: 0x10 0000 0000 to Ox7F_FFFF_FFFF (448 GiB) 
e System (Host) address range 0x80 0000 0000 to OxFF_FFFF_FFFF (512 GiB) 


The Local Address Range 0x00 0000 0000 to OxOF_FFFF_FFFF (64 GiB) is further divided into 4 equal size ranges: 


e | 0x00 0000 0000 to Ox03 FFFF FFFF (16 GiB) 

= GDDR (Low) Memory 

= Local APIC Range (relocatable) 0x00 FEEO 0000 to 0x00 FEEO OFFF (4 kB) 
= M Boot Code (Flash) and Fuse (via SBOX) Ox00 FFOO 0000 to OxOO FFFF FFFF (16 MB) 
= 0x04 0000 0000 to 0x07 FFFF FFFF (16 GB) 

GDDR Memory (up to PHY GDDR TOP) 

= 0x08 0000 0000 to OxOB FFFF FFFF (16 GB) 

e Memory mapped registers 

DBOX registers 0x08 007C 0000 to 0x08 007C FFFF (64 kB) 

SBOX registers 0x08 007D 0000 to 0x08 007D FFFF (64 kB) 

e Reserved 0xOC 0000 0000 to OxOF FFFF FFFF (16 GB) 


The System address range 0x80 0000 0000 to OxFF_FFFF_FFFF (512 GB) contains 32 pages of 16 GB each: 


e Sys 0: 0x80 0000 0000 to Ox83. FFFF FFFF (16 GB) 
e Sys 1:0x84. 0000 0000 to Ox87. FFFF. FFFF (16 GB) 


e Sys 31: 0xFC 0000 0000 to OxFF. FFFF. FFFF (16 GB) 


These are used to access System Physical Memory addresses and can access up to 512 GiB at any given time. Remote 
Intel? Xeon Phi™ coprocessor devices are also accessed through System addresses. All requests over PCI Express* to 
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Host are generated through this range. A System Memory Page Table (SMPT) expands the 40-bit local address to 64-bit 
System address. 


Accesses to host memory are snooped by the host if the No-snoop bit in the SMPT register is not set. The SCIF driver 
(see Sections 2.2.5.1 and 5.1) does not set this bit so host accesses are always snooped. Host accesses to Intel? Xeon 
Phi™ coprocessor memory are snooped if to cacheable memory. 


The System (Host) address map of Intel? Xeon Phi™ coprocessor memory is represented by two base address registers: 


e MEMBARO 
= Relocatable in 64-bit System Physical Memory Address space 
= Prefetchable 
= 32 GiB (max) down to 256 MiB (min) 
= Programmable in Flash 
= Offset into Intel? Xeon Phi™ coprocessor Physical Memory Address space 
" Programmable in APR PHY BASE register 
= Default is O 
e MEMBAR1 
= Relocatable in 64b System Physical Memory Address space 
=  Non-prefetchable 
= 128 KiB 
" Covers DBOXO & SBOX Memory-mapped registers 
=  DBOXat offset Ox0 0000 
" SBOXatoffset Ox1 0000 


2.1.13 Power Management 


Intel? Xeon Phi™ coprocessor power management supports Turbo Mode and other P-states. Turbo mode is an 
opportunistic capability that allows the CPU to take advantage of thermal and power delivery headroom to increase the 
operating frequency and voltage, depending on the number of active cores. Unlike the multicore family of Intel? Xeon? 
processors, there is no hardware-level power control unit (PCU); power management is controlled by the coprocessor 
OS. Please see Section 3.1 for more information on the power management scheme. 


Below is a short description of the different operating modes and power states. For additional details, see the "Intel? 
Xeon Phi™ Coprocessor Datasheet,” Document Number 488073. 


e Core C1 State — Core and VPU are clock gated (all 4 threads have halted) 
e Core C6 State- Core and VPU are power gated (C1 + time threshold) 
e Package C3 State 
e All Cores Clock or Power Gated 
e The Ring and Uncore are Clock Gated (MCLK gated (auto), VccP reduced (Deep)) 
" Package C6 State — The VccP is Off (Cores/Ring/Uncore Off) 
= Memory States 
e M1 -Clock Gating 
e M2-—GDDRin Self Refresh 
e M3-—M2 + Shut off 
="  GMCIk PLL 
=" SBOX States- L1 (PCI Express* Link States), SBOX Clock Gating 
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2.2 Intel? Xeon Phi'" Coprocessor Software Architecture 


The software architecture for Intel? Xeon Phi™ coprocessor products accelerates highly parallel applications that take 
advantage of hundreds of independent hardware threads and large local memory. Intel? Xeon Phi™ coprocessor 
product software enables easy integration into system platforms that support the PCI Express* interconnect and running 
either a Linux* or Windows* operating system. 


2.2.1 Architectural Overview 


Intel? Xeon Phi™ coprocessor products are implemented as a tightly integrated, large collection of processor cores 
(Intel? Many Integrated Core (MIC) Architecture) on a PCI Express? form-factor add-in card. As such, Intel? Xeon Phi™ 
coprocessor products comply as a PCI Express* endpoint, as described in the PCI Express* specification. Therefore, each 
Intel? Xeon Phi™ coprocessor card implements the three required address spaces (configuration, memory, and I/O) and 
responds to requests from the host to enumerate and configure the card. The host OS loads a device driver that must 
conform to the OS driver architecture and behavior customary for the host operating system running on the platform 
(e.g., interrupt handling, thread safe, security, ACPI power states, etc.). 


From the software prospective, each Intel? Xeon Phi™ coprocessor add-in card represents a separate Symmetric Multi- 
Processing (SMP) computing domain that is loosely-coupled to the computing domain represented by the OS running on 
the host platform. Because the Intel? Xeon Phi™ coprocessor cards appear as local resources attached to PCI Express*, 
it is possible to support several different programming models using the same hardware implementation. For example, 
a programming model requiring shared memory can be implemented using SCIF messaging for communication. Highly 
parallel applications utilize a range of programming models, so it is advantageous to offer flexibility in choosing a 
programming model. 


In order to support a wide range of tools and applications for High-Performance Computing (HPC), several Application 
Programming Interfaces (APIs) are provided. The standard APIs provided are sockets over TCP/IP*, MPI, and OpenCL*. 
Some Intel proprietary interfaces are also provided to create a suitable abstraction layer for internal tools and 
applications. The SCIF APIs provide a common transport over which the other APIs communicate between host and 
Intel? Xeon Phi™ coprocessor devices across the platform’s PCI Express hardware. Error! Reference source not found. 
illustrates the relative relationship of each of these APIs in the overall Intel? MIC Architecture Manycore Platform 
Software Stack (MPSS). 


As shown, the executable files and runtimes of a set of software development tools targeted at highly parallel 
programming are layered on top of and utilize various subsets of the proprietary APIs. 
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Figure 2-13. Intel? Xeon Phi'" Coprocessor Software Architecture 


The left side of the figure shows the host stack layered on a standard Linux* kernel. A similar configuration for an Intel? 
Xeon Phi™ coprocessor card is illustrated on the right side of the figure; the Linux*-based kernel has some Intel? Xeon 
Phi™ coprocessor specific modifications. 


This depicts the normal runtime state of the system well after the platform's system BIOS has executed and caused the 
host OS to be loaded. [Note: The host platform's system BIOS is outside the scope of this document and will not be 
discussed further.] Each Intel® Xeon Phi™ coprocessor card's local firmware, referred to as the “Bootstrap”, runs after 
reset. The Bootstrap configures the card's hardware, and then waits for the host driver to signal what is to be done 
next. It is at this point that the coprocessor OS and the rest of the card's software stack is loaded, completing the 
normal configuration of the software stack. 


The software architecture is intended to directly support the application programming models described earlier. Support 
for these models requires that the operating environment ( coprocessor OS, flash, bootstrap) for a specific Intel? Xeon 
Phi™ coprocessor is responsible for managing all of the memory and threads for that card. Although giving the host OS 
such a responsibility may be appropriate for host based applications (a kind of forward acceleration model), the host OS 
is not in a position to perform those services where work may be offloaded from any device to any other device. 


Supporting the application programming models necessitates that communications between host and Intel? Xeon Phi'" 
coprocessor devices, after boot, is accomplished via the SCIF driver or a higher level API layered on SCIF. SCIF is designed 
for very low latency, low overhead communication and provides independent communication streams between SCIF 
clients. 


A virtual network interface on top of the SCIF Ring O driver creates an IP-based network between the Intel? Xeon Phi™ 
coprocessor devices and the host. This network may be bridged to additional networks via host/user configuration. 


Given this base architecture, developers and development environments are able to create usage models specific to 
their needs by adding user-installed drivers or abstractions on top of SCIF. For example, the MPI stack is layered on SCIF 
to implement the various MPI communications models (send/receive, one sided, two sided) . 
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2.2.2 Intel? Manycore Platform Software Stack (MPSS) 
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Figure 2-14 outlines the high level pieces that comprise the Intel® Manycore Platform Software Stack, or MPSS. 


Page 38 


Intel? Xeon Phi'" Board 


Tools & Applications 


Ganglia 


: IDB Server 
Monitor 


| 

| 

| 

| BLCR Library 
| 

| CR Checkpoint Bi CR Restore 
| 


User - SCIF 


Linux Kernel 


LZ Syscall interface 


Subsystems 


Filesystem 


BLCR Kernel 
Module 
VM 


MC Exception 
Handler 


Linux NetDev OFED Core SMC Modified 
HCA Driver/ Responder 8250 driver 


SCIF 
System 


Monitor 
I2C API 


Optional 


Power Mgmt Coprocessor User-installable EG 


Driver SCIF Driver SPU & Sampler 
Driver 


Optional User- Stock /dev/ 
Controller installable SEP mem 


Driver Driver 


PMU Memory SPI Flash 
Counters Controllers Device 


12C Bus 


12C Bus 


Infiniband Bootstrap 
nfiniban : 
HCA VRs PWR Temp 7 Segment å fboot0: fboot1 n 
monitoring Sensors Display S232 fboott: GDDR Init, 
Auth, MP Init, ELF 


Host Add-in 
Cards 


Optional POST Board loader 


Maintenance Mode 


Figure 2-14. Intel? Xeon Phi™ Coprocessor Software Stack 
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2.2.3 Bootstrap 


Since the Intel? Xeon Phi™ coprocessor cores are x86 Intel Architecture cores, the bootstrap resembles a System BIOS at 
POST. The bootstrap runs when the board first gets power, but can also run when reset by the host due to a 
catastrophic failure. The bootstrap is responsible for card initialization and booting the coprocessor OS. 


The bootstrap consists of two separate blocks of code, called fbootO and fboot1. The fbootO block resides in ROM 
memory on the die and cannot be upgraded, while fboot1 is upgradeable in the field and resides in the flash memory. 


2.2.3.1 fbootO 


When the card comes out of reset, the fbootO instruction is executed first. This block of code is the root of trust because 
it cannot be modified in the field. Its purpose is to authenticate the second stage, fboot1, by passing the root of trust to 
fboot1. If authentication fails, fbootO will remove power from the ring and cores, preventing any further action. The only 
recovery mechanism from this state is to put the card into "zombie mode" by manually changing a jumper on the card. 
Zombie mode allows the host to reprogram the flash chip, recovering from a bad fboot1 block. 


The fbootO execution flow is as follows: 


Setup CAR mode to reduce execution time. 
Transition to 64-bit protected mode. 
Authenticate fboot1. 

If authentication fails, shut down card. 

If authentication passes, hand control to fboot1. 


Ur ogg 


2.2.3.2 fboot1 


Fboot1 is responsible for configuring the card and booting the coprocessor OS. The card configuration involves 
initializing all of the cores, uncore units, and memory. This includes implementing any silicon workarounds since the 
hardware does not support microcode patching like a typical x86 core. The cores must be booted into 64-bit protected 
mode to be able to access the necessary configuration registers. 


When booting a 3" party coprocessor OS, including the MPSS Linux*-based coprocessor OS, the root of trust is not 
passed any further. The root of trust is only passed when booting into maintenance mode since privileged operations 
are performed while in maintenance mode. Maintenance mode is where some locked registers are re-written for 
hardware failure recovery. 


Authentication determines which coprocessor OS type is booting Ee party or maintenance). Fboot1 calls back into 
fbootO to run the authentication routine using the public key also embedded in fbootO. Only the maintenance 
coprocessor OS is signed with a private key, and all other images must remain unsigned. If authentication passes, the 
maintenance coprocessor OS boots. If authentication fails, the process is assumed to be a 3" party coprocessor OS and 
the Linux* boot protocol is followed, locking out access to sensitive registers, protects intellectual property. 


The fboot1 execution flow is as follows: 


1. Set memory frequency then reset the card. 
2. Perform core initialization. 
3. Initialize GDDR5 memory 
a. Usetraining parameters stored in the flash if memory has been trained earlier. 
b. If no training parameters are stored, or these parameters do not match the current configuration, perform the 
normal training routine and store training values in the flash. 
4. Shadow fboot1 into GDDR5 to improve execution time. 
Perform uncore initialization. 
6. Perform CC6 register initialization. 


on 
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7. Boot APs. 
8. AP's transition to 64-bit protected mode. 
9.  AP's perform core initialization. 
10. AP's perform CC6 register initialization. 
11. AP'sreach the end of the AP flow and wait for further instructions. 
12. Wait for coprocessor OS download from host. 
13. Authenticate coprocessor OS. All cores participate in authentication to minimize execution time. 
14. If authentication passes, it is a maintenance coprocessor OS. Boot maintenance coprocessor OS. 
15. If authentication fails, it is a 3" party coprocessor OS (see Linux* loader section below). 
a. Lock out register access. 
b. Create boot parameter structure. 
c. Transition to 32-bit protected mode with paging disabled. 
d. Hand control to the coprocessor OS. 


2.2.4 Linux* Loader 


The Intel? Xeon Phi™ coprocessor boots Linux*-based coprocessor OS images. It is capable of booting any an party OS 
developed for the Intel® Xeon Phi™ coprocessor. Previously, an untrusted coprocessor OS would result in a card 
shutdown; however, the Intel? Xeon Phi™ coprocessor considers the Intel developed Linux*-based coprocessor OS to be 
untrusted. For this reason, it becomes simple to support 3" party coprocessor OS images. 


To boot a Linux* OS, the bootstrap has to conform to a certain configuration as documented in the Linux* kernel. There 
are 3 potential entry points into the kernel: 16-bit, 32-bit, and 64-bit entry points. Each entry point requires increasingly 
more data structures to be configured. The Intel® Xeon Phi™ coprocessor uses the 32-bit mode entry point. 


2.2.4.1 16-bit Entry Point 


The 16-bit entry point does not require any data structures to be created prior to entering the kernel; however it 
requires that there be support for system BIOS callbacks. The Intel? Xeon Phi'" coprocessor does not support this mode. 


2.2.4.2 32-bit Entry Point 


The 32-bit entry point requires a boot parameter (or zero page) structure and a structure defining the number of cores 
and other hardware (either an MP Table or SFI — Simple Firmware Interface - table). The Linux* documentation in 
boot.txt states "the CPU must be in 32-bit protected mode with paging disabled; a GDT must be loaded with the 
descriptors for selectors BOOT CS(0x10) and BOOT DS(0x18); both descriptors must be 4G flat segment; 

. BOOT CS must have execute/read permission, and ` BOOT DS must have read/write permission; CS must be 
__BOOT_CS and DS, ES, SS must be BOOT DS; interrupt must be disabled; %esi must hold the base address of the 
struct boot params; %ebp, %edi and %ebx must be zero." 


There exists a field in the boot parameter structure (load flags) that tells the kernel whether it should use the segments 
setup by the bootstrap or to load new ones. If the kernel loads new ones, it uses the above settings. The bootstrap, 
however, does not have the segment descriptors in the same order as required by this documentation; and therefore 
sets the boot parameter flag to tell the kernel to continue using the segments already setup by the bootstrap. 
Everything about the bootstrap descriptors matches the documentation except for the offset location in the GDT, so it is 
safe to continue using them. 


The bootstrap also uses the SFI tables to report the number of cores, memory map, and other hardware configurations. 
This is a relatively new format designed by Intel and adheres to SFI version 0.7 (http://simplefirmware.org). SFI support 
was initially added to the Linux* kernel in version 2.6.32. The Intel? Xeon Phi™ coprocessor supports booting a Linux* 
kernel by using the 32-bit entry point. 
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2.2.4.3 64-bit Entry Point 


The Intel? Xeon Phi™ coprocessor does not support this mode. 


2.2.5 The Coprocessor Operating System (coprocessor OS) 


The Intel? Xeon Phi™ coprocessor establishes the basic execution foundation that the remaining elements of the Intel® 
Xeon Phi™ coprocessor card’s software stack rest upon. The Intel? Xeon Phi™ coprocessor OS is based on a standard 
Linux* kernel source code (from kernel.org) with as few changes to the standard kernel as possible. While some areas of 
the kernel are designed, by the Linux* development community, to be tailored for specific architectures, this is not the 
general case. Therefore, additional modifications to the kernel have been made to compensate for hardware normally 
found on PC platforms, but missing from Intel? Xeon Phi™ coprocessor cards. 


The coprocessor OS provides typical capabilities such as process/task creation, scheduling, and memory management. It 
also provides configuration, power, and server management. Intel? Xeon Phi™ coprocessor-specific hardware is only 
accessible through a device driver written for the coprocessor OS environment. 


The Intel? Xeon Phi™ coprocessor Linux* kernel can be extended with loadable kernel modules (LKMs); LKMs may be 
added or removed with modprobe. These modules may include both Intel supplied modules, such as the idb server and 
SEP sampling collector, and end-user supplied modules. 
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Figure 2-15. The Linux* Coprocessor OS Block Diagram 


The Intel? Xeon Phi™ coprocessor Linux*-based coprocessor OS is a minimal, embedded Linux* environment ported to 
the Intel? MIC Architecture with the Linux* Standard Base (LSB) Core libraries. It is also an unsigned OS. It implements 
the Busybox* minimal shell environment. Table 2-11 lists the LSB components. 


Table 2-11. LSB Core Libraries 


Component | Description 

glibc the GNU C standard library 

libc the C standard library 

libm the math library 

libdl programmatic interface to the dynamic linking loader 

librt POSIX real-time library (POSIX shared memory, clock and time functions, timers) 
libcrypt password and data encryption library 

libutil library of utility functions 
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Component | Description 


libstdc++ the GNU C++ standard library 


libgcc s a low-level runtime library 

libz a lossless data compression library 

libcurses a terminal-independent method of updating character screens 

libpam the Pluggable Authentication Module (PAM) interfaces allow applications to request authentication via a 


system administrator defined mechanism 


2.2.5.1 CPUID Enumeration 


CPUID enumeration can be obtained via the Linux* OS APIs that report information about the topology as listed in 
/sys/devices/system/cpu/cpu*/topology/*. 


2.2.6 Symmetric Communication Interface (SCIF) 


SCIF is the communication backbone between the host processors and the Intel? Xeon Phi™ coprocessors in a 
heterogeneous computing environment. It provides communication capabilities within a single platform. SCIF enables 
communications between host and Intel® Xeon Phi™ coprocessor cards, and between Intel? Xeon Phi™ coprocessor 
cards within the platform. It provides a uniform API for communicating across the platform's PCI Express* system busses 
while delivering the full capabilities of the PCI Express* transport hardware. SCIF directly exposes the DMA capabilities 
of Intel? Xeon Phi™ coprocessor for high bandwidth transfer of large data segments, as well as the ability to map 
memory of the host or an Intel? Xeon Phi™ coprocessor device into the address space of a process running on the host 
or on any Intel? Xeon Phi™ coprocessor device. 


Communication between SCIF node pairs is based on direct peer-to-peer access of the physical memory of the peer 
node. In particular, SCIF communication is not reflected through system memory when both nodes are Intel® Xeon Phi™ 
coprocessor cards. 


SCIF's messaging layers take advantage of the PCI Express*'s inherent reliability, and operates as a simple data-only 
network without the need for any intermediate packet inspection. Messages are not numbered, nor is error checking 
performed. Due to the data-only nature of the interface, it is not a direct replacement for higher level communication 
APIs, but rather provides a level of abstraction from the system hardware for these other APIs. Each API that wishes to 
take advantage of SCIF will need to adapt to this new proprietary interface directly or through the use of a shim layer. 


A more detailed description of the SCIF API can be found in Section 5.3. 


2.2.7 Host Driver 


The host driver is a collection of host-side drivers and servers including SCIF, power management, and RAS and server 
management. The primary job of the host driver is to initialize the Intel? Xeon Phi™ coprocessor card(s); this includes 
loading the coprocessor OS and its required boot parameters for each of the cards. Following successful booting, the 
primary responsibility of the host driver is to serve as the root of the SCIF network. Additional responsibilities revolve 
around serving as the host-side interface for power management, device management, and configuration. However, the 
host driver does not directly support any type of user interface or remote process API. These are implemented by other 
user-level programs or by communication protocols built on top of the driver or SCIF (e.g. Sockets, MPI, etc.). 


DMA support is an asynchronous operation. Host initiated DMA is expected to have less latency compared to the proxy 
DMA from the card. Applications have the option to pick between memory copy and DMA, or to let the driver choose 
the best method. Memory copy is optimized to be multiple threaded, which makes use of the multi-core to parallelize 
the operation at the limit of the PCI Express* bandwidth. When there is a need to lower the host CPU load, or when the 
transfer size is above threshold, DMA is the preferred method. 
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Interrupts based on MSI/x (Message Signaled Interrupts) are supported by the host driver with these benefits: 


e Eliminates dedicated hardware interrupt line connection 

e No interrupt sharing with other device(s) 

e With optimized hardware design, no need for the interrupt routine to read back from hardware which will improve 
the efficiency of the interrupt handling 

e The device can target different CPU cores when triggering, thus making full use of the multicore for interrupt 
handling. 


User 


Kernel 


legend 


Figure 2-16. Intel? Xeon Phi'" Coprocessor Host Driver Software Architecture Components 


2.2.7.1 Intel? Xeon Phi™ Coprocessor SMC Control Panel 


The SMC Control Panel (micsmc), located in /opt/intel/mic/bin after installing Intel? MPSS, is the local host-side user 
interface for system management. The Control Panel is more practical for smaller setups like a workstation environment 
rather than for a large-scale cluster deployment. The Control Panel is mainly responsible for: 


e Monitoring Intel? Xeon Phi™ coprocessor card status, parameters, power, thermal, etc. 
e Monitoring system performance, core usage, memory usage, process information 

e Monitoring overall system health, critical errors, or events 

e Hardware configuration and setting, ECC, turbo mode, power plan setting, etc. 


Control Panel applications rely on the MicAccessSDK to access card parameters. The MicAccessSDK exposes a set of APIs 
enabling applications to access the Intel? MIC Architecture hardware. The Ring 3 system management agent running on 
the card handles the queries from the host and returns results to the host through the SCIF interface. 


The host RAS agent captures the MCA error report from the card and takes proper action for different error categories. 
The host RAS agent determines the error exposed to the end-user based on the error filter and Maintenance mode 
test/repair result. Then the error/failure is shown to end users on the Control Panel. 
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Figure 2-17. Control Panel Software Architecture 


2.2.7.2 Ganglia* Support 


Ganglia* is a scalable, distributed monitoring system for high-performance computing systems such as clusters and 
grids. The implementation of Ganglia* is robust, has been ported to an extensive set of operating systems and processor 
architectures, and is currently in use on thousands of clusters around the world. 


Briefly, the Ganglia* system has a daemon running on each computing node or machine. The data from these daemons 
is collected by another daemon and placed in an rrdtool database. Ganglia* then uses PHP scripts on a web server to 
generate graphs as directed by the user. The typical Ganglia* data flow is illustrated in Figure 2-18. 
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Figure 2-18. Ganglia* Monitoring System Data Flow Diagram 
The cluster level deployment of Ganglia* is illustrated in Figure 2-19. 
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Figure 2-19: Ganglia* Monitoring System for a Cluster 
e 


For integration with system management and monitoring systems like Ganglia*, the Manycore Platform Software Stack 


Provides an interface for the Ganglia* monitoring agent to collect monitoring state or data: sysfs or /proc virtual file 
system exposed by the Linux*-based coprocessor OS on each Intel® Xeon Phi™ coprocessor device. 
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e Provides a plug-in for custom made metrics about the nodes (that is, Intel? Xeon Phi™ coprocessor cards) that are 
being monitored by Ganglia*. 
e Serves as a reference implementation for the whole Ganglia* monitoring environment setup. 


In the Ganglia* reference implementation shown in Figure 2-20, each Intel? Xeon Phi™ coprocessor card can be treated 
as an independent computing node. Because Intel® Xeon Phi™ coprocessor is running a Linux*-based OS on the card, 
one can run gmond monitoring agent on the card as-is. Gmond supports configuration files and plug-ins so it is easy to 
add customized metrics. 


For workstation configuration or for a remote server in a cluster environment, gmetad can be run on the host. For 
gmetad, no customization is needed. All the front-end tools like rrdtool, scripts should be standard Ganglia* 


configuration. 
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Figure 2-20. Intel® Xeon Phi™ Coprocessor Ganglia* Support Diagram 


All of the daemons in Ganglia* talk to each other over TCP/IP. Intel® Xeon Phi™ coprocessor devices are accessible via a 
TCP/IP subnet off the host, in which the IP component is layered on SCIF. 


By default, Ganglia* collects the following metrics: 


" cpu num 
" cpu speed 


Page 48 


In addition to these default metrics, the following metrics can be collected on the Intel? Xeon Phi™ coprocessor: 


To collect additional metrics follow these steps: 
1. 


Write a script or C/C++ program which retrieves the information. The script can be written in any scripting 


mem total 
swap total 
boottime 
machine type 
os name 
os_release 
location 
gexec 
cpu_user 
cpu_system 
cpu_idle 
cpu_nice 
cpu_aidle 
cpu_wio 
cpu_intr 
cpu_sintr 
load_one 
load_five 
load_fifteen 
proc_run 
proc_total 
mem_free 
mem_shared 
mem_buffers 
mem_cached 
swap_free 
bytes_out 
bytes_in 
pkts_in 
pkts_out 
disk_total 
disk_free 


part_max_used 


Intel® Xeon Phi™ coprocessor device utilization 
Memory utilization 


Core utilization 


Die temperature 
Board temperature 
Core frequency 
Memory frequency 


Core voltage 


Memory voltage 
Power consumption 


Fan speed 


Active core number (CPU number is standard) 


language. Python is used to retrieve default metrics. In case of a C/C++ program, the .so files are needed. 
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2. Register the program with the Ganglia* daemon (gmond) by issuing the Ganglia* command gmetric. 
3. Make the registration persistent by adding the modification to the configuration file: /etc/ganglia/gmond.conf. 


2.2.7.3 Intel® Manycore Platform Software Stack (MPSS) Service 


The Linux* mechanism for controlling system services is used to boot and shut down Intel? Xeon Phi™ coprocessor 
cards. This service will start (load) and stop (unload) the MPSS to and from the card (e.g. "service mpss start/stop"). This 
replaces the micstart command utility described in the next section. Please see the README file included in the MPSS tar 
packages for instructions on how to use this service. 


2.2.7.4 Intel? MIC Architecture Commands 


This section provides a short summary of available Intel? MIC Architecture commands. More detailed information of 
each command can be obtained by issuing the ‘—help’ option with each command. 


Table 2-12. Intel? MIC Architecture commands 


Command Description 

micflash A command utility normally used to update the Intel? Xeon Phi™ coprocessor 
PCI Express* card on-board flash. It can also be used to list the various device 
characteristics. 


micinfo Displays the physical settings and parameters of the card including the driver 
versions. 
micsmc The Control Panel that displays the card thermal, electrical, and usage 


parameters. Examples include Core Temperature, Core Usage, Memory Usage, 
etc. An API for this utility is also available to OEMs under the MicAccess SDK as 
mentioned previously in the section on the Control Panel. 

miccheck A utility that performs a set of basic checks to confirm that MPSS is correctly 
installed, all communications links between the host and coprocessor(s), and 
between coprocessors are functional. 


2.2.8 Sysfs Nodes 


Sysfs is a Linux* 2.6 virtual file system. It exports information about devices and drivers from the kernel device model to 
user space; and is similar to the sysctl mechanism found in BSD systems, albeit implemented as a file system. As such, 
some Intel? Xeon Phi™ coprocessor device characteristics can be obtained from sysfs. Characteristics such as core/cpu 
utilization, process/thread details and system memory usage are better presented from standard /proc interfaces. The 
purpose of these sysfs nodes is to present information not otherwise available. The organization of the file system 
directory hierarchy is strict and is based on the internal organization of kernel data structures. 


Sysfs is a mechanism for representing kernel objects, their attributes, and their relationships with each other. It provides 
two components: a kernel programming interface for exporting these items via sysfs, and a user interface to view and 
manipulate these items that maps back to the kernel objects they represent. Table 2-13 shows the mapping between 
internal (kernel) constructs and their external (user space) Sysfs mappings. 


Table 2-13. Kernel to User Space Mappings 


Object Attributes Regular Files 
Object Relationships Symbolic Links 
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The currently enabled sysfs nodes are listed in Table 2-14. 


Table 2-14. SYSFS Nodes 


Node Description 

clst Number of known cores 
fan Fan state 

freq Core frequencies 


gddr GDDR device info 

gfreq GDDR frequency 

gvolt GDDR voltage 

hwinf hardware info (revision, stepping, ...) 


temp Temperature sensor readings 
vers Version string 
volt Core voltage 
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Figure 2-21: MPSS Ganglia* Support 


Sysfs is a core piece of the kernel infrastructure that provides a relatively simple interface to perform a simple task. 
Some popular system monitoring software like Ganglia* uses /proc or the sysfs interface to fetch system status 
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information. Since the Intel? Xeon Phi™ coprocessor can expose card information through sysfs, a single interface can be 
maintained for both local and server management. 


2.2.9 Intel? Xeon Phi'" Coprocessor Software Stack for MPI Applications 


This section covers the architecture of the Intel® Xeon Phi™ coprocessor software stack components to enable uDAPL 
and IB verbs support for MPI. Given the significant role of MPI in high-performance computing, the Intel? Xeon Phi™ 
coprocessor has built-in support for OFED* (Open Fabrics Enterprise Edition) which is widely used in high performance 
computing for applications that require high efficiency computing, wire-speed messaging, and microsecond latencies. 
OFED* is also the preferred communications stack for the Intel? MPI Library, allowing Intel? MIC Architecture to take 
advantage of remote direct memory access (RDMA) capable transport that it exposes. The Intel? MPI Library for Intel® 
MIC Architecture on OFED* can use SCIF or physical InfiniBand* HCA (Host Channel Adapter) for communications 
between Intel? Xeon Phi™ coprocessor devices and between an Intel® Xeon Phi™ coprocessor and the host; in this way, 
Intel? Xeon Phi™ coprocessor devices are treated as stand-alone nodes in an MPI network. 


There are two implementations that cover internode and intranode communications through the InfiniBand* HCA: 


e  CCL(Coprocessor Communication Link). A proxy driver that allows access to a hardware InfiniBand* HCA from the 
Intel? Xeon Phi™ coprocessor. 
e OFED*/SCIF. A software-based InfiniBand*-like device that allows communication within the box. 


This guide only covers the first level decomposition of the software into its major components and describes how these 
components are used. This information is based on the OpenFabrics Alliance* (OFA*) development effort. Because 
open source code is constantly changing and evolving, developers are responsible for monitoring the OpenFabrics 
Alliance* to ensure compatibility. 


2.2.9.1 Coprocessor Communication Link (CCL) 


To efficiently communicate with remote systems, applications running on Intel® Many Integrated Core Architecture 
(Intel® MIC Architecture) coprocessors require direct access to RDMA devices in the host platform. This section 
describes an architecture providing this capability (called CCL) that is targeted for internode communication. 


In a heterogeneous computing environment, it is desirable to have efficient communication mechanisms from all 
processors, whether they are the host system CPUs or Intel® Xeon Phi™ coprocessor cores. Providing a common, 
standards-based, programming and communication model, especially for clustered system applications is an important 
goal of the Intel® Xeon Phi™ coprocessor software. A consistent model not only simplifies development and 
maintenance of applications, but allows greater flexibility for using a system to take full advantage of its performance. 


RDMA architectures such as InfiniBand* have been highly successful in improving performance of HPC cluster 
applications by reducing latency and increasing the bandwidth of message passing operations. RDMA architectures 
improve performance by moving the network interface closer to the application, allowing kernel bypass, direct data 
placement, and greater control of I/O operations to match application requirements. RDMA architectures allow process 
isolation, protection, and address translation to be implemented in hardware. These features are well-suited to the 
Intel? Xeon Phi™ coprocessor environment where host and coprocessor applications execute in separate address 
domains. 


CCL brings the benefits of RDMA architecture to the Intel? Xeon Phi"" coprocessor. In contrast, without CCL, 
communications into and out of attached processors must incur an additional data copy into host memory, substantially 
impacting both message latency and achievable bandwidth. Figure 2-22 illustrates the operation of an RDMA transfer 
with CCL and an Intel? Xeon Phi™ coprocessor add-in PCI Express* card. 
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CCL allows RDMA device hardware to be shared between Linux*-based host and Intel? Xeon Phi™ coprocessor 
applications. Figure 2-23 illustrates an MPI application using CCL. 
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Figure 2-22 RDMA Transfer with CCL 
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Figure 2-23 MPI Application on CCL 


Figure 2-23 highlights the primary software modules (bolded rounded components) responsible for CCL. The host 
system contains a PCI Express* interface with one or more RDMA devices and one or more Intel? Xeon Phi™ coprocessor 
add-in cards. Software modules on the host and Intel? Xeon Phi™ coprocessor communicate with each other and access 
RDMA devices across the PCI Express* bus. The software uses a split-driver model to proxy operations across PCI 
Express* to manage RDMA device resources allocated by the Vendor Driver on the host. These modules include the IB* 
Proxy Daemon, the IB* Proxy Server, the IB* Proxy Client, the Vendor Proxy Drivers, and SCIF. 


RDMA operations are performed by a programming interface known as verbs. Verbs are categorized into privileged and 
non-privileged classes. Privileged verbs are used to allocate and manage RDMA resources. Once these resources have 
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been initialized, non-privileged verbs are used to perform UO operations. I/O operations can be executed directly to 
and from user-mode applications on the Intel? Xeon Phi™ coprocessor concurrently with host I/O operations, with 
kernel-mode bypass, and with direct data placement. The RDMA device provides process isolation and performs 
address translation needed for UO operations. CCL proxies privileged verb operations between host and Intel? Xeon 
Phi™ coprocessor systems such that each Intel? Xeon Phi™ coprocessor PCI Express* card appears as if it were another 
"user-mode" process above the host IB* core stack. 


2.2.9.1.1 1B* Core Modifications 


The IB* core module defines the kernel-mode verbs interface layer and various support functions. Support functions 
that allow vendor drivers to access user-mode data are: 


= ib copy to udata() 

= ib copy from udata() 

" jb umem get() 

" jb umem page count() 
" jb umem release() 


These functions may be used by vendor drivers for privileged verb operations. Since the implementation of these 
functions assumes that data is always in host system user-space, modifications allowed redirection of these functions for 
CCL. The IB* Proxy Server overrides the default implementation of these functions to transfer data to or from the Intel® 
Xeon Phi™ coprocessor as needed. To be effective, vendor drivers must use the support functions provided by IB* core. 


2.2.9.1.2 Vendor Driver Requirements 

The IB* core module provides support functions that allow Vendor Drivers to access user-mode data. Instead of using 

the IB* core support functions, however, some Vendor Driver implementations call user-mode access routines directly. 

Table 2-15 lists drivers that require modification to work with CCL. Currently, only the Mellanox HCAs are supported. 
Table 2-15: Vendor Drivers Bypassing IB* Core for User-Mode Access 


|. [amsoii00*|cxgb3* | cxgb4* | ehca* | ipath* | mix4* | mthca* | nes* | qib* 
[pyto wer | Jo | | [X | j| — | [X 


copy from user | | o | PP 18 
jgetuserpages | LL pf po | IX: 


Beyond utilizing the IB* core interface support functions, there are additional requirements for enabling Vendor Drivers 
to take full advantage of CCL. Table 2-16 shows that RDMA is divided into two distinct architectures, InfiniBand* and 
iWARP*. 


The underlying process for establishing a connection differs greatly between InfiniBand* and iWARP* architectures. 
Although InfiniBand* architecture defines a connection management protocol, it is possible to exchange information 
out-of-band and directly modify a queue pair to the connected state. UDAPL implements a socket CM (SCM) protocol 
that utilizes this technique and only requires user-mode verbs access through CCL. For i WARP* architecture, however, 
this requires the rdma cm kernel module to invoke special iWARP* CM verbs. Therefore, to support iWARP* devices, 
CCL must proxy rdma cm calls between the host and the Intel? Xeon Phi™ coprocessor. 


As shown in Table 2-16, the IBM* eHCA* device is not supported on x86 architecture; it requires a PowerPC* system 
architecture, which is not supported by Intel? Xeon Phi™ coprocessor products. 


QLogic* provides ipath* and qib* drivers, which are hybrid hardware/software implementations of InfiniBand* that in 
some cases use memcpy() to transfer data and that do not provide full kernel bypass. 
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Table 2-16: Summary of Vendor Driver Characteristics 


Driver Vendor RDMA Type x86 Support Kernel Bypass 


cxgb3* Chelsio Communications* — i WARP* y 


es yes 
cxgb4* Chelsio Communications* iWARP* yes yes 
no yes 


ehca* IBM Corporation* InfiniBand* 

ipath* QLogic* InfiniBand* yes no 
mlx4* Mellanox Technologies* InfiniBand* yes yes 
mthca* Mellanox Technologies* InfiniBand* y 


es yes 
nes* Intel Corporation iWARP* yes yes 
yes no 


qib* QLogic* InfiniBand* 


2.2.9.1.3 IB* Proxy Daemon 


The IB* Proxy Daemon is a host user-mode application. It provides a user-mode process context for IB* Proxy Server 
calls (through the IB* core) to the underlying vendor drivers. A user-mode process context is needed to perform 
memory mappings without modifying the existing vendor drivers. Vendor drivers typically map RDMA device MMIO 
memory into the calling user-mode process virtual address space with ioremap(), which requires a valid user-mode 
current->mm structure pointer. 


An instance of the IB* Proxy Daemon is started via a udev "run" rule for each Intel® Xeon Phi™ coprocessor device added 
by the IB* Proxy Server. The IB* Proxy Daemon is straightforward. It immediately forks to avoid blocking the udev 
device manager thread. The parent process exits while the child examines the action type for device add notifications; 
all other notifications are ignored and the daemon simply exits. If a device add notification is received, the device is 
opened followed by zero byte write. It is this call to write that provides the user-mode process context used by the IB* 
Proxy Server. When the IB* Proxy Server relinquishes the thread, the write completes, and the IB* Proxy Daemon closes 
the device and exits. 


2.2.9.1.4 IB* Proxy Server 


The IB* Proxy Server is a host kernel module. It provides communication and command services for Intel? Xeon Phi™ 
coprocessor IB* Proxy Clients. The IB* Proxy Server listens for client connections and relays RDMA device add, remove, 
and event notification messages. The IB* Proxy Server initiates kernel-mode IB* verbs calls to the host IB* core layer on 
behalf of Intel? Xeon Phi™ coprocessor IB* Proxy Clients and returns their results. 


Upon initialization, the IB* Proxy Server registers with the host IB* core for RDMA device add and remove callbacks, and 
creates a kernel thread that listens for Intel? Xeon Phi™ coprocessor connections through SCIF. The IB* Proxy Server 
maintains a list of data structures for each side of its interface. One list maintains RDMA device information from IB* 
core add and remove callbacks, while another list maintains connections to IB* Proxy Clients running on the Intel? Xeon 
Phi™ coprocessor. Together these lists preserve the state of the system so that RDMA device add and remove messages 
are forwarded to IB* Proxy Clients. 


When an IB* Proxy Client connection is established through SCIF, the IB* Proxy Server creates a device that represents 
the interface. The device exists until the SCIF connection is lost or is destroyed by unloading the driver. The Linux* 
device manager generates udev events for the device to launch the IB* Proxy Daemon. The IB* Proxy Server uses the IB* 
Proxy Daemon device write thread to send add messages for existing RDMA devices to the IB* Proxy Client, and enters a 
loop to receive and process client messages. Any RDMA device add or remove notifications that occur after the IB* 
Proxy Client SCIF connections are established are sent from the IB* core callback thread. In addition, the IB* Proxy 
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Server forwards asynchronous event and completion queue notification messages from IB* core to IB* Proxy Clients. 
These messages are also sent from the IB* core callback thread. 


The IB* Proxy Server performs verbs on behalf of IB* Proxy Clients. Received messages are dispatched to an appropriate 
verb handler where they are processed to generate a verb response message. Verbs are synchronous calls directed to 
specific Vendor Drivers through the IB* core interface. The IB* Proxy Server performs pre- and post-processing 
operations as required for each verb, and maintains the state required to teardown resources should a SCIF connection 
abruptly terminate. Privileged verbs provide access to user-mode data to Vendor Drivers through IB* core support 
functions. The IB* Proxy Server overrides the default implementation of these functions to transfer data to or from 
Intel? Xeon Phi™ coprocessors as needed. 


2.2.9.1.5 IB* Proxy Client 


The IB* Proxy Client is an Intel? Xeon Phi™ coprocessor kernel module. The IB* Proxy Client provides a programming 
interface to vendor proxy drivers to perform IB* verbs calls on the host. The interface abstracts the details of formatting 
commands and performing the communication. The IB* Proxy Client invokes callbacks for device add, remove, and 
event notifications to registered Intel? Xeon Phi™ coprocessor Vendor Proxy Drivers. 


Upon initialization, the IB* Proxy Client creates a kernel thread to establish a connection to the IB* Proxy Server through 
SCIF. The IB* Proxy Client maintains a list of data structures for each side of its interface. One list maintains ROMA 
device information received from IB* Server add and remove messages, while another list maintains Vendor Proxy 
Drivers that have registered with the IB* Proxy Client. Together, these lists preserve the state of the system so that 
RDMA device add and remove callbacks are forwarded to Vendor Proxy Drivers as required. 


When a connection to the IB* Proxy Server is established through SCIF, the IB* Proxy Client enters a loop to receive and 
process server messages. With the exception of verb response messages, all device add, remove, asynchronous event, 
and completion queue notification messages are queued for processing on a Linux work queue. Processing these 
messages on a separate thread is required to avoid a potential communication deadlock with the receive thread. Device 
add and remove message callbacks are matched to registered Vendor Proxy Drivers using PCI vendor and device ID 
information. Asynchronous event and completion queue notifications are dispatched to callback handlers provided 
upon resource creation or to the Intel? Xeon Phi'" coprocessor IB* core layer. 


The IB* Proxy Client provides a verbs command interface for use by Vendor Proxy Drivers. This interface is modeled 
after the IB* Verbs Library command interface provided for user-mode Vendor Libraries. A Vendor Proxy Driver uses 
this interface to perform IB* verbs calls to the Vendor Driver on the host. The interface abstracts the details of 
formatting commands and performing the communication through SCIF. Verbs are synchronous calls; the calling thread 
will block until the corresponding verb response message is received to complete the operation. 


2.2.9.1.6 Vendor Proxy Driver 


A vendor proxy driver is an Intel? Xeon Phi™ coprocessor kernel module. Different vendor proxy drivers may be 
installed to support specific RDMA devices. Upon initialization, each Vendor Proxy Driver registers with the IB* Proxy 
Client for RDMA device add and remove notifications for the PCI vendor and device IDs that it supports. The Vendor 
Proxy Driver uses the programming interface provided by the IB* Proxy Client to perform kernel-mode IB* verbs calls. 
The Vendor Proxy Driver handles the transfer and interpretation of any private data shared between the vendor library 
on the Intel? Xeon Phi™ coprocessor and vendor driver on the host. 


A vendor proxy driver announces that a device is ready for use when it calls the IB* core ib register device() function. 
All initialization must be complete before this call. The device must remain usable until the call to ib unregister device() 
has returned, which removes the device from the IB* core layer. The Vendor Proxy Driver must call ib register device() 
and ib unregister device() from process context. It must not hold any semaphores that could cause deadlock if a 
consumer calls back into the driver across these calls. 
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Upper level protocol consumers registered with the IB* core layer receive an add method callback indicating that a new 
device is available. Upper-level protocols may begin using a device as soon as the add method is called for the device. 
When a remove method callback is received, consumers must clean up and free all resources relating to a device before 
returning from the remove method. A consumer is permitted to sleep in the add and remove methods. When a Vendor 
Proxy Driver call to ib unregister device() has returned, all consumer allocated resources have been freed. 


Each vendor proxy driver provides verb entry points through an ib device structure pointer in the ib register device() 
call. All of the methods in the ib device structure exported by drivers must be fully reentrant. Drivers are required to 
perform all synchronization necessary to maintain consistency, even if multiple function calls using the same object are 
run simultaneously. The IB* core layer does not perform any serialization of verb function calls. 


The vendor proxy drivers use the programming interface provided by the IB* Proxy Client to perform IB* verbs calls to 
the vendor driver on the host. Each vendor proxy driver is responsible for the transfer and interpretation of any private 
data shared between the vendor library on the Intel? Xeon Phi™ coprocessor and the vendor driver on the host. 
Privileged verb operations use the default IB* core support functions to transfer data to or from user-mode as needed. 
The interpretation of this data is vendor specific. 


2.2.9.2 OFED*/SCIF 


The Symmetric Communications Interface (SCIF) provides the mechanism for internode communication within a single 
platform, where a node is an Intel? Xeon Phi'" coprocessor device or a host processor complex. SCIF abstracts the details 
of communicating over PCI Express* (and controlling related coprocessor hardware) while providing an API that is 
symmetric between the host and the Intel® Xeon Phi™ coprocessor. 


MPI (http://www.mpi-forum.org) (Message-Passing Interface) on the Intel? Xeon Phi™ coprocessor can use either the 
TCP/IP or the OFED* stack to communicate with other MPI nodes. The OFED*/SCIF driver enables a hardware 
InfiniBand* Host Communications Adapter (IBHCA) on the PCI Express* bus to access physical memory on an Intel? Xeon 
Phi™ coprocessor device. When there is no IBHCA in the platform, the OFED*/SCIF driver emulates an IBHCA, enabling 
MPI applications on the Intel? Xeon Phi™ coprocessor devices in the platform. 


OFED*/SCIF implements a software-emulated InfiniBand* HCA to allow OFED*-based applications, such as the Intel? 
MPI Library for Intel® MIC Architecture, to run on Intel? MIC Architecture without the presence of a physical HCA. 
OFED*/SCIF is only used for intranode communication whereas CCL is used for internode communication. 


OFED* provides an industrial standard low-latency, high-bandwidth communication package for HPC applications, 
leveraging the RDMA-based high performance communication capabilities of modem fabrics such as InfiniBand*. SCIF is 
a communication API (sections 2.2.5.1 and 5.1) for the Intel? Many Integrated Core Architecture (Intel® MIC 
Architecture) device that defines an efficient and consistent interface for point-to-point communication between Intel? 
Xeon Phi™ coprocessor nodes, as well as between it and the host. By layering OFED* on top of SCIF, many OFED*-based 
HPC applications become readily available to Intel? MIC Architecture. 


The OFED* software stack consists of multiple layers, from user-space applications and libraries to kernel drivers. Most 
of the layers are common code shared across hardware from different vendors. Vendor dependent code is confined in 
the vendor-specific hardware driver and the corresponding user-space library (to allow kernel bypass). Figure 2-24 
shows the architecture of the OFED*/SCIF stack. Since SCIF provides the same API for Intel? Xeon Phi™ coprocessor and 
the host, the architecture applies to both cases. 


The rounded bold blocks in Figure 2-24 are the modules specific to OFED*/SCIF. These modules include the IB-SCIF 
Library, IB-SCIF Driver, and SCIF (the kernel space driver only). 
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2.2.9.2.1 1B-SCIF Library 


The IB-SCIF Library is a user-space library that is required by the IB Verbs Library to work with the IB-SCIF Driver. It 
defines a set of routines that the IB Verbs Library calls to complete the corresponding functions defined by the user- 
mode IB Verbs API. This allows vendor specific optimization (including kernel bypass) to be implemented in user space. 
The IB-SCIF Library, however, does not provide kernel bypass; it relays user-mode requests to the kernel-mode driver 
through the interface exposed by the IB uverbs driver. 


2.2.9.2.2  IB-SCIF Driver 


The IB-SCIF Driver is a kernel module that implements a software-based RDMA device. At initialization, it sets up one 
connection between each pair of SCIF nodes, and registers to the IB core driver as an “iWARP” device (to avoid MAD 
related functions being used). For certain OFED* operations (plain RDMA read/write), data is transmitted directly using 
the SCIF RMA functions. For all other OFED* operations, data is transmitted as packets, with headers that identify the 
communication context so that a single connection between two SCIF nodes is sufficient to support an arbitrary number 
of logical connections. Under the packet protocol, small-sized data is transmitted with the scif send() and scif recv() 
functions; and large-sized data is transmitted with the SCIF RMA functions after a hand shaking. When both ends of the 
logical connection are on the same SCIF node (i.e. loopback), data is copied directly from the source to the destination 
without involving SCIF. 


2.2.9.2.3 SCIF (See also Section 5.1) 


The SCIF kernel module provides a communication API between Intel? Xeon Phi™ coprocessors and between an Intel? 
Xeon Phi™ coprocessor and the host. SCIF itself is not part of OFED*/SCIF. OFED*/SCIF uses SCIF as the only internode 
communication channel (in SCIF terminology, the host is a node, and each Intel? Xeon Phi™ coprocessor card is a 
separate node). Although there is a SCIF library that provides similar API in the user space, that library is not used by 
OFED*/SCIF. 


Page 59 


MPI Application 


IB Verbs Library 
IB-SCIF Library 


——— n; Kernel Mode——————3À 


IB-SCIF driver 


Host / Coprocessor 


Figure 2-24: OFED*/SCIF Modules 
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2.2.9.3 Intel® MPI Library for Intel? MIC Architecture 


The Intel® MPI Library for Intel® MIC Architecture provides only the Hydra process manager (PM). Each node and each 
coprocessor are identified using their unique symbolic or IP addresses. Both external (e.g., command line) and internal 
(e.g., MPI Comm Spawn) methods of process creation and addressing capabilities to place executables explicitly on the 
nodes and the coprocessors are available. This enables you to match the target architecture and the respective 
executables. 


Within the respective units (host nodes and coprocessors), the MPI processes are placed and pinned according to the 
default and eventual explicit settings as described in the Intel? MPI Library documentation. The application should be 


able to identify the platform it is running on (host or coprocessor) at runtime. 


The Intel? MPI Library for Intel? MIC Architecture supports the communication fabrics shown in Figure 2-25. 


MPI-2.2 


ADI3* 


CH3* 


Figure 2-25. Supported Communication Fabrics 


2.2.9.3.1 Shared Memory 


This fabric can be used within any coprocessor, between the coprocessors attached to the same node, and between a 
specific coprocessor and the host CPUs on the node that the coprocessor is attached to. The intracoprocessor 
communication is performed using the normal mmap(2) system call (shared memory approach). All other 
communication is performed in a similar way based on the scif mmap(2) system call of the Symmetric Communication 
Interface (SCIF). This fabric can be used exclusively or combined with any other fabric, typically for higher performance. 
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Figure 2-26. Extended SHM Fabric Structure 


The overall structure of the extended SHM fabric is illustrated in Figure 2-26. The usual shared memory (SHM) 
communication complements the SCIF SHM extension that supports multisocket platforms, each socketed 
processor having a PCI Express* interface. SCIF-based SHM extensions can be used between any host processor 
and any Intel? Xeon Phi™ coprocessor, and between any two such coprocessors connected to separate PCI 
Express* buses. 


2.2.9.3.2 DAPL/OFA* 


This fabric is accessible thru two distinct interfaces inside the Intel? MPI Library: the Direct Application Programming 
Library (DAPL*) and the Open Fabrics Association (OFA*) verbs [also known as Open Fabrics Association Enterprise 
Distribution (OFED*) verbs] of the respective Host Channel Adaptor (HCA). In both cases, the typical Remote Memory 
Access (RMA) protocols are mapped upon the appropriate parts of the underlying system software layers; in this case, 
scif writeto(2) and scif readfrom(2) SCIF system calls. 


2.2.9.3.3 TCP 


This fabric is normally the slowest of all fabrics available. This fabric is normally used as a fallback communication 
channel when the higher performance fabrics mentioned previously cannot be used for some reason. 


2.2.9.3.4 Mixed Fabrics 


All these fabrics can be used in reasonable combinations for the sake of better performance; for example, shm:dapl, 
shm:OFA*, and shm:tcp. All the default and eventual explicit settings described in the Intel® MPI Library documentation 
are inherited by the Intel® MPI Library for Intel® MIC Architecture. This also holds for the possibility of intranode use of 
both the shared memory and RDMA interfaces such as DAPL or OFA*. 


2.2.9.3.5 Standard Input and Output 


Finally, the Intel® MPI Library for Intel® MIC Architecture supports the following types of input/output (I/O): 
e Standard file I/O. The usual standard I/O streams (stdin, stdout, stderr) are supported through the Hydra PM as 
usual. All typical features work as expected within the respective programming model. The same is true for the file 
I/O. 
e  MPII/O. All MPI I/O features specified by the MPI standard are available to all processes if the underlying file 
system(s) support it. 


Please consult the (Intel? MPI Library for Intel? MIC Architecture, 2011-2012) user guide for details on how to set up and 
get MPI applications running on systems with Intel? Xeon Phi™ coprocessors. 
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2.2.10 Application Programming Interfaces 


Several application programming interfaces (APIs) aid in porting applications to the Intel? Xeon Phi™ coprocessor 
system. They are the sockets networking interface, the Message Passing Interface (MPI), and the Open Computing 
Language (OpenCL*), and are industry standards that can be found in multiple execution environments. Additionally, the 
SCIF APIs have been developed for the Intel? Xeon Phi™ coprocessor. 


2.2.10.1 SCIF API 


SCIF serves as the backbone for intraplatform communication and exposes low-level APIs that developers can program 
to. A more detailed description of the SCIF API can be found in Section 5. 


2.2.10.2 NetDev Virtual Networking 


The virtual network driver provides a network stack connection across the PCI Express* bus. The NetDev device driver 
emulates a hardware network driver and provides a TCP/IP network stack across the PCI Express* bus. The Sockets API 
and library provide parallel applications with a means of end-to-end communication between computing agents (nodes) 
that is based on a ubiquitous industry standard. This API implemented upon the TCP/IP protocol stack simplifies 
application portability and scalability. Other standard networking services, such as NFS, can be supported through this 
networking stack. See Section 5 for more details. 
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3 Power Management, Virtualization, RAS 


The server management and control panel component of the Intel? Xeon Phi™ coprocessor software architecture 
provides the system administrator with the runtime status of the Intel? Xeon Phi™ coprocessor card(s) installed into a 
given system. There are two use cases that are of interest. The first is the rack-mounted server that is managed 
remotely and that relies on 3"-party management software. The second is a stand-alone pedestal or workstation 
system that uses a local control panel application to access information stored on the system. Applications of this type 
are designed to execute in a specific OS environment, and solutions for both the Linux* and the Windows operating 
systems are available. Although these implementations may utilize common modules, each must address the particular 
requirements of the target host OS. 


There are two access methods by which the System Management (SM)/control panel component may obtain status 
information from the Intel? Xeon Phi™ coprocessor devices. The “in-band” method uses the SCIF network and the 
capabilities designed into the coprocessor OS and the host driver; delivers Intel? Xeon Phi™ coprocessor card status to 
the user; and provides a limited ability to control hardware behavior. The same information can be obtained using the 
"out-of-band" method. This method starts with the same capabilities in the coprocessors, but sends the information to 
the Intel? Xeon Phi™ coprocessor card's System Management Controller (SMC). The SMC can then respond to queries 
from the platform's BMC using the IPMB protocol to pass the information upstream to the user. 


3.1 Power Management (PM) 


Today's power management implementations increasingly rely on multiple software pieces working cooperatively with 
hardware to improve the power and performance of the platform, while minimizing the impact on performance. Intel? 
MIC Architecture based platforms are no exception; power management for Intel? Xeon Phi™ coprocessors involves 
multiple software levels. 


Power management for the Intel? Xeon Phi™ coprocessor is predominantly performed in the background. The power 
management infrastructure collects the necessary data to select performance states and target idle states, while the 
rest of the Intel? Manycore Platform Software Stack (MPSS) goes about the business of processing tasks for the host OS. 
In periods of idleness, the PM software places Intel? Xeon Phi™ coprocessor hardware into one of the low-power idle 
states to reduce the average power consumption. 


Intel? Xeon Phi™ coprocessor power management software is organized into two major blocks. One is integrated into 


the coprocessor OS running locally on the Intel? Xeon Phi™ coprocessor hardware. The other is part of the host driver 
running on the host. Each contributes uniquely to the overall PM solution. 
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MIC Power Management Software Architecture 
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Figure 3-1. Intel® Xeon Phi™ Coprocessor Power Management Software Architecture 


3.1.1 


Coprocessor OS Role in Power Management 


Because this code controls critical power and thermal management safeguards, modification of this code may void the 
warranty for Intel® Xeon Phi™ coprocessor devices used with the modified code. 


Power management capabilities within the coprocessor OS are performed in the kernel at ringO. The one exception is 
that during PC6 exit, the bootloader plays an important role after loss of core power. 
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Primarily the coprocessor OS kernel code is responsible for managing: 


e Selection and setting of the hardware's performance level (P-states) including any "Turbo Mode" capability that 
may be present. 

e Data collection used to assess the level of device utilization; device thermal and power consumption readings must 
be collected to support the P-state selection process. 

e Modified P-state selection, which is based on externally imposed limits on card power consumption. 

e Selection and setting of core idle states (C-states). 

e Data collection to assess the level of device utilization that will be used to support core C-state selection. 

e Save and restore CPU context on core C6 entry and exit. 

e  Orchestrate the entry and exit of the package to Auto C3 package state in order to ensure that the coprocessor OS 
is able to meet the scheduled timer deadlines. 

e Prepare the Intel? Xeon Phi™ coprocessor for entry into the PC6-state (that is, to make sure all work items are 
completed before entering PC6-state), save and restore machine context before PC6 entry and after PC6 exit, and 
return to full operation after PC6-state exit. The Bootloader then performs reset initialization and passes control to 
the GDDR resident coprocessor OS kernel. 


PM services executing at RingO provide the means to carry out many of the PM operations required by the coprocessor 
OS. These services are invoked at key event boundaries in the OS kernel to manage power on the card. The active-to- 
idle transition of the CPU in the kernel is one such event boundary that provides an opportunity for PM services in the 
kernel to capture data critical for calculating processor utilization. In addition, idle routines use restricted instructions 
(e.g. HLT or MWAIT) enabling processors to take advantage of hardware C-states. Other services perform or assist in 
evaluating hardware utilization, selection, and execution of target P- and C-states. Finally, there are services that 
directly support entry and exit from a particular C-state. 


PC-state entry/exit refers to the dedicated execution paths or specific functions used during transition from a CO state of 
operation to a specific PC-state (entry) or from a specific PC-state back to the CO state (exit). To minimize the time 
required in transition, these dedicated execution paths must be tailored to the specific hardware need of the target PC- 
state. Minimizing transition times enables PC-states to be used more frequently, thus reducing lower average power 
consumption without any user-perceived impact on performance. 


3.1.2 Bootloader Role in Power Management 


The BootLoader is put into service during PC6 exit. This PC6-state lowers the VccP voltage to zero. As a result, the Intel? 
Xeon Phi™ coprocessor cores begin code execution at the reset vector (i.e. FFFF FFFOh) when the voltage and clocks are 
restored to operational levels. However, unlike cycling power at the platform level or at a cold reset, an abbreviated 
execution path designed specifically for PC6 state exit can be executed. This helps minimize the time required in 
returning Intel? Xeon Phi™ coprocessor to full operation and prevents a full-scale boot process from destroying GDDR 
contents that are retained under self-refresh. These shortened execution paths are enabled in part by hardware state 
retention on sections that remain powered and through the use of a self-refresh mechanism for GDDR memory devices. 


3.1.3 Host Driver Role in Power Management 
The Host driver plays a central role in power management. Its primary power management responsibilities are: 


e Tomonitor and manage the Intel? Xeon Phi™ coprocessor package idle states. 

e To address server management queries. 

e To drive the power management command/status interface between the host and the coprocessor OS. 
e To interface with the host communication layer. 
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3.1.4 Power Reduction 


The PM software reduces card power consumption by leveraging the Intel? Xeon Phi™ coprocessor hardware features 
for voltage/core frequency scaling (P-states), core idle states, and package idle states. By careful selection of the 
available P-states and idle states, the PM software opportunistically reduces power consumption without impacting the 
application performance. For all the idle and P-states, software follows a two-step approach: state selection followed by 
state control or setting. The architecture reflects this by grouping the modules as either state-selection or state-setting 
modules. 


3.1.4.1  P-State Selection 


The PM software uses Demand Based Scaling (DBS) to select the P-state under which the cores operate. "Demand" 
refers to the utilization of the CPUs over a periodic interval. An increase in CPU utilization is seen as a signal to raise the 
core frequency (or to reduce the P-state) in order to meet the increase in demand. Conversely, a drop in utilization is 
seen as an opportunity to reduce the core frequency and hence save power. The primary requirement of the P-state 
selection algorithm is to be responsive to changes in workload conditions so that P-states track the workload 
fluctuations and hence reduce power consumption with little or no performance impact. Given this sensitivity of the P- 
state selection algorithm to workload characteristics, the algorithm undergoes extensive tuning to arrive at an optimum 
set of parameters. The software architecture allows for extensive parameterization of the algorithms and even the 
ability to switch algorithms on the fly. Some of the parameters that can be changed to affect the P-state selection are: 


e Evaluation period over which utilization values are calculated. 

e Utilization step size over which a P-state selection is effective. 

e  P-state step size that controls the P-state gradient between subsequent selections. 

e Guard bands around utilization thresholds to create a hysteresis in the way the P-states are increased and 
decreased. This prevents detrimental ping-pong behavior of the P-states. 


The architecture supports user-supplied power policy choices that can map to a set of predefined parameters from the 
list above. Other variables such as power budget for the Intel? Xeon Phi™ coprocessor hardware, current reading, and 
thermal thresholds can factor into the P-state selection either as individual upper limits that cause the P-states to be 
throttled automatically, or can be combined in more complex ways to feed into the selection algorithm. 


The coprocessor OS has exclusive responsibility for P-state selection. The P-state selection module contains the 
following routines: 


e Initialization 
e Evaluation task 
e Notification handler 


The P-state selection module has interfaces to the core coprocessor OS kernel, the P-state setting module, and the PM 
Event Handler. The architecture keeps this module independent of the underlying hardware mechanisms for setting P- 
states (i.e., detecting over-current or thermal conditions, etc.). 


The P-state selection module registers a periodic timer task with the coprocessor OS core kernel. The "evaluation 
period" parameter decides the interval between consecutive invocations of the evaluation task. Modern operating 
system kernels maintain per-CPU running counters that keep track of the cumulative time that the CPU is idle, that the 
CPU executes interrupt code, that the CPU executes kernel code, and so on. The evaluation task wakes up every 
evaluation time period, reads from these per-CPU counters the total time the that CPU was idle during the last 
evaluation window, and calculates the utilization for that CPU. For the purpose of calculating the target P-state, the 
maximum utilization value across all CPUs is taken. Since the evaluation task runs in the background while the CPUs are 
executing application code, it is important that software employs suitable methods to read an internally consistent value 
for the per-CPU idle time counters without any interference to code execution on the CPUs. 
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Once the maximum utilization value (as a percentage of evaluation period) across all CPUs is computed, the evaluation 
task has to map this value to a target P-state. There are a number of ways this can be accomplished. Figure 3-2 shows 
one way this can be done. Thresholds ThD1 and ThD2 provide the hysteresis guard band within which the P-state 
remains the same. The goal of this algorithm is to raise P-states (that is, lower core frequency) progressively till the 
maximum utilization value is increased to a configurable threshold value (THD2) value. As workload demand increases 
and the maximum utilization increases beyond this threshold, the algorithm decreases the target P-state (increase core 
frequency) still keeping within power and thermal limits. The threshold values, P-state increase and decrease step size, 
are all parameters that either map to a policy or set explicitly. 


The P-state selection module has to handle notifications from the rest of the system and modify its algorithm 
accordingly. The notifications are: 


e Start and stop thermal throttling due to conditions such as CPUHOT. 
e Changes to the card power budget. 

e Thermal threshold crossings. 

e Changes to power policy and P-state selection parameters. 


The notifications bubble up from the PM Event Handler and can be totally asynchronous to the evaluation task. The 
effect of these notifications can range from modifications to P-state selection to a complete pause or reset of the 
evaluation task. 


The host driver generally does not play an active role in the P-state selection process. However, the host driver 
interfaces with the coprocessor OS P-state selection module to get P-state information, to set or get policy, and to set or 
get parameters related to P-state selection. 


3.1.4.2 P-State Control 


The P-state control module implements the P-states on the target Intel® Xeon Phi™ coprocessor hardware. The process 
of setting P-states in the hardware can vary between Intel® Xeon Phi™ coprocessor products. Hence the P-state module, 
by hiding the details of this process from other elements of the PM software stack, makes it possible to reuse large parts 
of the software between different generations of Intel® Xeon Phi™ coprocessors. 


P-state control operations take place entirely within the coprocessor OS. The P-state control module has the following 
main routines: 


e  P-state table generation routine 
e P-state set/get routine 

e SVID programming routine 

e Notifier routine 


The P-state control module exports: 


e  Get/set P-state 

e Register notification 

e Read current value 

e Set core frequency/voltage fuse values 


On Intel? Xeon Phi™ coprocessor devices (which do not have an architecturally defined mechanism to set P-states, like 
an MSR write), the mapping of P-states to core frequency and voltage has to be generated explicitly by software and 
stored in a table. The table generation routine takes as parameters: 


e Core frequency and voltage pairs for a minimal set of guaranteed P-states (Pn, P1 and PO) from which other pairs 
can be generated using linear interpolation. 
e Core frequency step sizes for different ranges of the core frequency. 
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e Mapping between core frequency value and corresponding MCLK code. 
e Mapping between voltage values and SVID codes. 


There are hardware-specific mechanisms by which these P-states are made available to the coprocessor OS. In the 
Intel? Xeon Phi™ coprocessor, these values are part of a preset configuration space that is read by the bootloader and 
copied to flash MMIO registers and read by the P-state control module. This routine exports the "Set core 
frequency/voltage fuse configuration" so that the coprocessor OS flash driver that initializes the MMIO registers 
containing the fuse configuration can store them before they get initialized. 


The P-state Get/Set routine uses the generated P-state table to convert P-states to core frequency and voltage pairs and 
vice versa. 


Other parts of the coprocessor OS may need to be notified of changes to core frequency. For example parts of the 
coprocessor OS that use the Timestamp Counter (TSC) as a clock source to calculate time intervals must be notified of 
core frequency changes so that the TSC can be recalibrated. The notifier routine exports a "register notification" 
interface so that other routines in the coprocessor OS can call-in to register for notification. The routine sends a 
notification any time a core frequency change occurs as a result of a P-state setting. 


3.1.4.3 Idle State Selection 


Prudent use of the core and package idle states enables the Intel? Xeon Phi™ coprocessor PM software to further 
reduce card power consumption without incurring a performance penalty. The algorithm for idle state selection 
considers two main factors: the expected idle residency and the idle state latency. In general, the deeper the idle state 
(and hence the greater the power saving), the higher the latency. The formula for deciding the particular idle state to 
enter is of the form: 


Expected idle residency >= C * (ENTRY LATENCYCx + EXIT LATENCYCx) 


Where: 
- Cisaconstant that is always greater than one and determined by power policy. It can also be set 
explicitly. 
- ENTRY LATENCYCx is the time required to enter the Cx idle state. 
- Exit LATENCYCx is the time required to exit the Cx idle state. 


The comparison is performed for each of the supported idle states (Cx) and the deepest idle state that satisfies this 
comparison is selected as the target idle state. If none of the comparisons are successful, then the target idle state is set 
to CO (no idle state). 


The expected idle residency for a CPU is a function of several factors; some of which are deterministic such as 
synchronous events like timers scheduled to happen on the CPU at certain times in the future (that will force the CPU 
out of its idleness) and some of which are nondeterministic such as interprocessor interrupts. 


In order to keep the idle-state selection module independent of the specific Intel? Xeon Phi™ coprocessor, the PM 
software architecture includes data structures that are used to exchange information between the idle-state selection 
and hardware-specific idle state control modules, such as the: 


e Number of core idle states supported by the hardware 

e Number of package idle states supported for each core and package idle state 
e Name of the state (for user-mode interfaces) 

e Entry and exit latency 

e Entry point of the routine to call to set state 

e Average historical residency of state 
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e TSC and LAPIC behavior in this idle state 
e  Bitmasks marking core CPUs that have selected this idle state 


The idle-state control module fills in most of the information in these data structures. 


3.1.4.3.1 Core Idle State Selection 


The Intel® Xeon Phi™ coprocessor supports a Core C1 idle state and a deeper Core C6 idle state. Both core idle states are 
a logical AND operations of the individual idle states of the CPUs that make up the core. While entry and exit into the 
core C1 state needs no software intervention (except the individual CPUs executing a HALT), Core C6 entry and exit 
require the CPU state to be saved/restored by software. Hence a deliberate choice has to be made by software running 
on the CPU whether to allow the core (of which the CPU is part ) transition to Core C6 state. 


3.1.4.3.2 The coprocessor OS Role in Core Idle State Selection 


Core idle state selection happens entirely in the coprocessor OS. As mentioned before, modern operating systems have 
an architecturally defined CPU idle routine. Entry to and exit from idleness occurs within this routine. The core idle- 
selection module interfaces with this routine to select the core idle state on entry and to collect idleness statistics on 
exit (to be used for subsequent idle state selections). The core idle state selection module has the following main 
routines: 


e Core idle select 
e Core idle update 
e Core idle get and set parameter 


Figure 3-2 shows the Core C6 selection process in the Intel? Xeon Phi™ coprocessor. 
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3.1.4.3.2.1 Core Idle Select 


This routine interfaces to the coprocessor OS CPU idle routine and gets control before the CPU executes the idle 
instruction (HALT in the case of Intel? Xeon Phi™ coprocessor). The core idle select routine runs the algorithm to 
compute the expected idle residency of the CPU. The main components in the idle residency calculation are the next 
timer event time for the CPU and the historic idle residency values for the CPU. 


In the case of core C6 for the Intel? Xeon Phi™ coprocessor, the algorithm running on the last CPU in the core to go idle 
can optionally estimate the idle residency of the core by taking into account the expected idle residency of other idle 
CPUs in the core and the time elapsed since the other CPUs went idle. 


3.1.4.3.2.2 Core Idle Update 


This routine interfaces to the coprocessor OS CPU idle routine and gets control after the CPU wakes up from idle. It 
records the actual residency of the CPU in the idle state for use in the computation of the historic idle residency 
component in the core idle selection. 


3.1.4.3.2.3 Core Idle Get/Set Parameter 


This routine provides interfaces to user-mode programs that allow them to get and set core idle state parameters such 
as the latency constant C used in the equation to determine target core idle state. 


3.1.4.3.3 Package Idle State Selection 


The Intel? Xeon Phi™ coprocessor supports package idle states such as Auto-C3 (wherein all cores and other agents on 
the ring are clock gated), Deeper-C3 (which further reduces the voltage to the package), and Package C6 (which 
completely shuts off power to the package while keeping card memory in self-refresh). Some of the key differences 
between the package idle states and the core (CPU) idle states are: 


e One of the preconditions for all package idle states is that all the cores be idle. 

e Unlike P-states and core idle states, package state entry and exit are controlled by the Intel? Xeon Phi™ 
coprocessor host driver (except in Intel? Xeon Phi™ coprocessor Auto-C3 where it is possible to enter and exit the 
idle state without host driver intervention). 

e Wake up from package idle states requires an external event such as PCI Express* traffic, external interrupts, or 
active intervention by the Intel? Xeon Phi™ coprocessor driver. 

e Idle residency calculations for the package states take into account the idle residency values of all the cores. 

e Since the package idle states cause the Timestamp counter (TSC) and the local APIC timer to freeze, an external 
reference timer like the SBox Elapsed Time Counter (ETC) on the Intel? Xeon Phi™ coprocessor can be used, on 
wake up from idle, to synchronize any software timers that are based on the TSC or local APIC. 


3.1.4.3.4 The coprocessor OS Role in Package Idle State Selection 


The coprocessor OS plays a central role in selecting package idle states. The package idle state selection is facilitated in 
the coprocessor OS by three main routines: 

e package idle select 

e package idle update 

e  get/set package idle parameter 


3.1.4.3.4.1 Package Idle Select 


The last CPU that is ready to go idle invokes the package idle-select routine. As with the core idle state selection 
algorithm, the package idle-select algorithm bases its selection on the expected idle residency of the package and the 
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latency of the package idle state. The expected idle residency is calculated using the earliest scheduled timer event 
across all cores and the historical data on package idleness. 


On the Intel? Xeon Phi™ coprocessor, the coprocessor OS selects the PC3 and PC6 package states. Figure 3-4 shows the 
software flow for package idle-state selection. 


While selecting a package idle state, the coprocessor OS PM software can choose to disregard certain scheduled timer 
events that are set up to accomplish housekeeping tasks in the OS. This ensures that such events do not completely 
disallow deeper package idle states from consideration. It is also possible for the coprocessor OS package idle-state 
selection algorithm to choose a deeper idle state (such as PC6), and still require that the package exit the deep idle state 
in order to service a timer event. In such cases, the coprocessor OS informs the host PM software not only the target 
package idle-state selected but also the desired wake up time from the idle state. 
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3.1.4.3.4.2 Package Idle Update 


This routine is invoked upon wake up from a package idle state. It records the actual time that the package was idle, 
which is then used in the idle residency calculation. Since the TSC and the local APIC timers freeze during a package idle 
state, this routine uses an external clock (such as the SBox ETC) on Intel? Xeon Phi™ coprocessor cards to measure the 
package idle time. 


3.1.4.3.5 Host Driver Role in Package Idle State Selection 


The PM task in the host driver plays a key role in the package idle-state selection process. Though the coprocessor OS 
selects the package idle state based on its assessment of the expected idle residency, there are other reasons that might 
cause the host PM task to modify this selection. Some of these are: 


e The coprocessor OS selects PC3 based on the expected residency of the cores. However, PC3 depends on the 
idleness of both the core and the uncore parts of the package. So, it is possible for a PC3 selection by the 
coprocessor OS to be overridden by the host driver if it determines that some part of the uncore chain is busy. 

e |fthe idle residency estimate by the coprocessor OS for a certain package idle state turns out to be too 
conservative and the package stays in the selected idle state longer than the estimated time, the host driver can 
decide to select a deeper idle state than the one chosen by the coprocessor OS. 

e Package idle states, such as DeepC3 and PC6 on the Intel? Xeon Phi™ coprocessor, require the active intervention 
of the host driver to wake up the package so that it can respond to PCI Express* traffic from the host. Therefore, 
these deeper idle states might be unsuitable in scenarios where the card memory is being accessed directly by a 
host application that bypasses the host driver. The host driver should detect such situations and override the 
deeper idle-state selections. 


3.1.4.3.6 Coprocessor OS-to-Host Driver Interface for Package Idle Selection 
The coprocessor OS and the host driver use two main interfaces to communicate their package idle state selections: 


e The coprocessor OS-host communication interface through SCIF message. 

e The PM state flags such as the WOSPMState and hostPMState. In the Intel? Xeon Phi™ coprocessor, these flags are 
implemented as registers in the MMIO space. The uUOSPMState is written by the coprocessor OS to indicate its state 
selection, and read by the host driver and vice versa for the hostPMsState flag. 


The SCIF API and the package idle control API are implemented so as to be hardware independent. 


3.1.4.4 Idle State Control 


The idle state control function sets the cores (or the package) to the selected idle state. While controlling the core's idle 
state is primarily handled by the coprocessor OS, controlling the package idle state requires co-ordination between the 
host driver and the bootstrap software. 


3.1.4.4.1 Coprocessor OS Role in Idle State Control 


The idle-state control module in the coprocessor OS implements the selected core or package idle state on the target 
Intel? Xeon Phi™ coprocessor. It hides all the hardware details from the selection module. It initializes the data 
structures that it shares with the idle-state selection module with information on idle states specific to the Intel? Xeon 
Phi™ coprocessor. The interface to the selection module is mainly through these data structures. Table 3-1 lists some 
low-level routines in this module that are common to all idle states. 
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Table 3-1. Routines Common to All Package Idle States 


Routine Description 

Save CPU State Saves the register state of the selected logical processor. The CPU state includes 
basic program execution registers, x87 FPU registers, control registers, memory 
management registers, debug registers, memory type range registers (MTRR), and 
machine specific registers (MSR). The VPU register context is also saved. 

Restore CPU State Restores the register state that was saved by the Save CPU State routine. 

Save Uncore State Saves the Intel? Xeon Phi™ coprocessor hardware states that are not associated 
with CPUs (e.g. SBox). This function is used to preserve the uncore context in 
preparation for or during the PC6 entry sequence. 

Restore Uncore State | Restores the Intel? Xeon Phi™ coprocessor hardware state that was saved by the 
Save Uncore State routine. 


3.1.4.4.2 Core Idle State Control in the Coprocessor OS 


There are two routines that control the idle state of the core (Core C6): CC6 Enter and CC6 Exit. 


3.1.4.4.2.1 CC6 Enter 


The CC6 Enter routine starts when Core C6 is selected to prepare the CPU for a CC6 entry. However, if one or more 
other CPUs either are non-idle or did not enable C6, then the core might not enter the C6 idle state. The return from this 
routine to the caller (that is, to the CPU idle routine) looks exactly the same as a return from a Core C1 (return from 
HALT). The only way software using an Intel? Xeon Phi™ coprocessor can figure out that a CPU entered Core C6 is when 
the CPU exits Core C6 and executes its designated CC6 exit routine. The essential sequence of actions in this routine is as 
follows: 


Start CC6 Enter. 

Reset the count of CPUs that have exited CC6 (only for last CPU in core going idle). 
Save CR3. 

Switch page tables to the identity map of the lower 1MB memory region. 

Run Save CPU State. 

Enable CC6 for the selected CPU. 

Enable interrupt and HALT. 


ØRN 


The real mode trampoline code runs in lower memory (first MB of memory), and the CC6 Enter entry point is an address 
in this memory range. The idle-state control driver copies the trampoline code to this memory area during its 
initialization. It is also important to make sure that this memory range is not used by the bootloader program. 


3.1.4.4.2.2 CC6 Exit 


When cores exit from CC6 (as a result of an interrupt to one or more CPUs in the core), they come back from reset in 
real mode and start executing code from an entry point that is programmed by the Enable CC6 routine. The essential 
sequence of actions in the CC6 exit routine is as follows: 


Start CC6 Exit. 

Run trampoline code to set up for 64-bit operation. 

Detect the CPU number from the APIC identification number. 
Restore the CPU state. 

Restore CR3. 

Increment the count of CPUs in the core that have exited CC6. 
Enable interrupt and HALT. 


N En wr mon 
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As shown in Figure 3-5, it is possible for a CPU to exit CC6 while remaining HALTED and to go back to CC6 when the CC6 
conditions are met again. If a CPU stays HALTED between entry and exit from CC6, it is not required that the CPU state 
be saved every time it transitions to CC6. 
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Figure 3-5 CPU Idle State Transitions 


3.1.4.4.3 Package Idle State Control 


Table 3-2 Package Idle State Behavior in the Intel? Xeon Phi™ Coprocessor 


Package Uncore 
TSC/LAPI WakeupTi PCI E * Traffi 
Idle State Core State State SCH C C3WakeupTimer CI Express* Traffic 
PC3 Preserved Preserved | Frozen on ee Package exits PC3 
package exits PC3 
Deep C3 Preserved Preserved | Frozen No effect Times out 
PC6 Lost Lost Reset No effect Time out 


As shown in Table 3-2, the package idle states behave differently in ways that impact the PM software running both on 
the card as well as on the host. The idle-state control driver handles the following key architectural issues: 


e  LAPIC behavior: The LAPIC timer stops counting forward when the package is in any idle state. Modern operating 
systems support software timers (like the POSIX timer) that enable application and system programs to schedule 
execution in terms of microseconds or ticks from the current time. On the Intel? Xeon Phi™ coprocessor, due to the 
absence of platform hardware timers, the LAPIC timer is used to schedule timer interrupts that wake up the CPU to 
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service the software timer requests. When the LAPIC timer stops making forward progress during package idle 
states, timer interrupts from the LAPIC are suspended. So, the software timers cannot be serviced when the 
package is in an idle state. In order for the operating system to honor such software timer requests, the package 
idle state control software enlists the services of hardware timers, such as the C3WakeupTimer in the Intel? Xeon 
Phi™ coprocessor, or the host driver to wake up the card in time to service the scheduled timers. 

TSC behavior: On the Intel? Xeon Phi'" coprocessor, the TSC is used as the main clock source to maintain a running 
clock of ticks in the system. When the TSC freezes during package idle states, the software must be able to rely on 
an external reference clock to resynchronize the TSC based clock upon exit from the package idle state. On the 
Intel? Xeon Phi™ coprocessor, the SBox Elapsed Time Counter can be used for this purpose. 

Effect of PCI Express* traffic: While PCI Express* traffic brings the card out of a Package C3 idle state, it does not 
do so for deeper idle states such as DeepC3 or PC6. Also, the transition to DeepC3 or PC6 from PC3 does not 
happen automatically but requires active intervention from host software. Consequently, when the host driver 
places the card in one of these deep package idle states, it has to ensure that all subsequent PCI Express* traffic to 
the card be directed through the host driver. This makes it possible for the host driver to bring the card out of one 
of these deeper package idle states so that the card can respond to the subsequent PCI Express* traffic. 

Core and uncore states: While the core and uncore states are preserved across PC3 and DeeperC3 idle states entry 
and exit, they are not preserved for PC6. So, when the host driver transitions the package to PC6 from PC3 or 
DeepC3, it has to wake up the card and give the coprocessor OS a chance to save the CPU state as well as to flush 
the L2 cache before it puts the package in PC6 idle state. 


Package idle state control is implemented both in the coprocessor OS and in the host driver. 


3.1.4.4.3.1 Package Idle State Control in the Coprocessor OS 


The coprocessor OS role in package idle-state control is limited to the PC3 and PC6 idle states. DeepPC3 is controlled by 
the host driver, and the coprocessor OS has no knowledge of it. Coprocessor OS package idle state control mainly 
consists of the following activities: 


Prepare the coprocessor OS and the hardware to wake up from idle state in order to service timer interrupts. 
Save the core/uncore state and flush L2 cache, when necessary. 

On exit from package idle state reprogram LAPIC timers and synchronize timekeeping using an external reference 
clock such as the ETC on the Intel? Xeon Phi™ coprocessor. 

Send and receive messages to the host driver, and update the UOSPMstate flag with package idle state as seen 
from the coprocessor OS. 


3.1.4.4.3.2 PC3 Entry 


This function handles the package C3 idle state entry. As shown in Figure 3-6, this function is called from the core idle- 
state control entry function of the last CPU in the system to go idle. The core idle-selection module selects the package 
idle state in addition to the CPU idle state for the last CPU going idle and calls the core idle-state control entry function. 
The sequence of actions this function executes is: 


mu P H 


Start PC3 Entry. 

The last CPU going idle sets up the C3WakeupTimer so that the package will exit PC3 in time to service the earliest 
scheduled timer event across all CPUs. 

Record current tick count and reference clock (ETC) time. 

Set LOSPMState flag to PC3. 

Send message to host driver with target state and wake up time. 

CPU HALTS. 


There might be conditions under which the time interval to the earliest scheduled timer event for the package is larger 
than what can be programmed into the C3WakeupTimer. In such cases the coprocessor OS relies on the host driver to 
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wake up the package. The package idle-state readiness message that the coprocessor OS sends to the host PM software 
could optionally include wake up time. The host driver will wake up the package at the requested time. 


3.1.4.4.3.3 PC3 Exit 


An exit from the package C3 idle state happens when the C3WakeupTimer expires and exits from PC3 or when PCI 
Express* traffic arrives and causes the package to exit PC3. Figure 3-6 illustrates the former. It is important to remember 
that in either case, when the package exits PC3, it triggers the GoalReached interrupt when the core frequency reaches 
the set value. One possible sequence of events that can happen in this case is as follows: 


1. The C3WakeupTimer expires and the package exits PC3. 
2. The GoalReached interrupt wakes up BSP. 
3. The BSP processes DCH exit. 


Although the package is set up for PC3 and all the CPUs are HALTED, there is no guarantee that the package actually 
transitioned to PC3 idle. So, any CPU that wakes up after PC3 Entry is executed, must check to make sure that a 
transition to PC3 idle did indeed take place. One way that this can be done is through the hostPMState flag that is set by 
the host when it confirms that the package is in PC3 idle. 


The sequence of steps taken by the PC3 Exit routine is as follows: 


Start PC3 Exit. 

Check the hostPMsState flag to confirm transition to PC3. 

If the hostPMState flag is not set, then set the UOSPMState flag to PCO. 
Send UOS PM PC3 ABORT message to the host driver. 

Return . 

Read the ETC and calculate package residency in AutoC3. 

Update kernel time counters. 

Send AutoC3 wakeup IPI to all APs. 

Reprogram the Boot Strap Processor (BSP) LAPIC timer for earliest timer event on BSP. 
10. Set the uOSPMState flag to PCO. 

11. Send UOS PM PC3 WAKEUP message to the host driver. 

12. Return. 


Q9. 00 c4 OY Ur P» to mr 


The sequence of steps taken by the AC3 wakeup IPI handler (on all Application Processors (APs)) is: 


1. Reprogram LAPIC timer for earliest timer event on CPU 
2. Return 
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Figure 3-6. Package C-state Transitions 


3.1.4.4.3.4 PC6 Entry 


The coprocessor OS runs the PC6 Entry routine either when the coprocessor OS idle-state selection module selects PC6 
as the target package idle state or when the host PM software decides that the package has been in PC3 long enough to 
warrant a deeper idle state like PC6. In the latter case, the host software sends a PC6 Request message to the 
coprocessor OS that invokes the PC6 Entry routine. Architecturally, the PC6 idle state is similar to the APCI S3 suspend 
state, wherein the memory is in self refresh while the rest of the package is powered down. The sequence of actions 
this routine executes consists of: 


PC6 Entry (on BSP) 

Save CR3. 

Switch page tables (to identity map for lower 1MB memory region). 
Send C6 Entry IPI to all APs. 

Wait for APs to finish PC6 Entry preparation. 

Save uncore context to memory. 

Record the ETC value and current tick count. 

Save BSP context to memory. 

Flush cache. 

Set the WOSPMState flag to PC6. 


WON m Ur» p M p 


KA 
2 
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11. Send PC6 ready message to host. 
12. HALT BootStrap Processor (BSP). 


Or 
13. PC6 Entry (on AP). 
14. Save CR3. 


15. Switch page tables (to identity map for lower 1MB memory region). 
16. Save AP context to memory. 

17. Set flag to mark PC6 Entry completion. 

18. Flush cache. 

19. HALT AP. 


The PC6 entry implementation takes advantage of the fact that when the PC6 selection is made, it is more than likely 
that most of the cores are already in Core C6, and therefore have already saved the CPU context. If the L2 cache is 
flushed before the last CPU in every core prepares to go to Core C6, then the PC6 Entry algorithm might not need to 
wake up CPUs (from core C6) only to flush the cache. This reduces the PC6 entry latencies and simplifies the design, but 
the cost of doing a L2 cache flush every time a core is ready for CC6 has to be factored in. 


3.1.4.4.3.5 PC6 Exit 


The host driver PM software is responsible for bringing the package out of a PC6 idle state when the host software 
attempts to communicate with the card. The implicit assumption in any host-initiated package idle-state exit is that after 
the card enters a deep idle state, any further communication with the card has to be mediated through the host PM 
software. Alternatively, the host PM software can bring the card out of a package idle state if the coprocessor OS on the 
card has requested (as part of its idle entry process) that it be awakened after a certain time interval. 


The sequence of actions this routine executes consists of: 


1. PC6 Exit (BSP). 
2. Begin BSP execution from the reset vector because of the VccP transition from 0 to minimum operational 
voltage and the enabling of MCLK. 
3. BootLoader determines that this is a PC6 Exit (as opposed to a cold reset). 
4. BootLoader begins execution of specific PC6 Exit sequence. 
5. Bootstrap passes control to PC6 Exit entry point in GDDR resident coprocessor OS. 
6. BSPrestores processor context. 
7. BSPrestores uncore context. 
8. BSPreadsthe SBox ETC and updates kernel time counters. 
9. BSP wakes up APs. 
10. BSP sets UOSPMState to PCO. 
11. BSP sends coprocessor OS Ready message to host driver . 
Or 
PC6_Exit (AP). 
AP begins execution of trampoline code and switches to 64 bit mode. 
AP restores processor state. 
Signals PC6_Exit complete to BSP. 


Bump 
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Figure 3-7 Package C6 Entry and Exit Flow 


3.1.4.4.3.6 Bootloader Role in Idle State Control 


The bootloader program co-ordinates the exit from PC6 as well as facilitating the waking up of cores from CC6. The 
Bootloader interfaces with both the coprocessor OS and the host Intel? MPSS driver to enable these transitions. The 
main interfaces are: 


e Interface to reserve memory in the first megabyte of GDDR to install Core C6 wake up code 
e Interface with host Intel® MPSS driver to obtain PC6 entry point into the coprocessor OS kernel. 
e Interface with the host Intel® MPSS driver to detect a PC6 exit as against a cold reset. 


One Intel® Xeon Phi™ coprocessor implementation option is for the host Intel® MPSS driver to send the PC6 exit entry 
point as part of a BootParam structure that is located in a region of GDDR memory at a well-known address between the 
host Intel? MPSS driver and the Bootloader. 


The hostPMState MMIO register could be used by the Bootloader to distinguish a PC6 exit from cold reset. 


Every Intel? Xeon Phi™ coprocessor core has a block of registers that is initialized by the Bootloader, and then locked 
against subsequent write access for security reasons. However, since these register contents are lost during CC6, the 
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Intel? Xeon Phi™ coprocessor reserves a block of SBox MMIO registers that are used to maintain a copy of these secure 
register contents. It is the Bootloader's responsibility to initialize this block with the contents of the control registers 
during the boot up process. Subsequently, when a core wakes up from CC6, the ucode copies the contents of the SBox 
register block back into the core registers. 


3.1.5 PM Software Event Handling Function 


One of the key roles for the Intel? MIC Architecture PM software is the handling of power and thermal events and 
conditions that occur during the operation of the Intel? Xeon Phi™ coprocessor. These events and conditions are 
handled primarily by the coprocessor OS PM Event Handler module. The number and priority of these events are 
hardware dependent and implementation specific. However, these events fall into two basic categories: proactive and 
reactive. 


For example, the Intel? Xeon Phi™ coprocessor has the ability to notify the coprocessor OS when the die temperature 
exceeds programmed thresholds, which allows the software to act proactively. On the other hand, the coprocessor OS 
software acts reactively when an OverThermal condition occurs in the die by automatically throttling the core frequency 
to a predetermined lower value and interrupting the CPU. 
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Table 2-1 lists the events and conditions that the coprocessor OS should handle for the Intel? Xeon Phi™ coprocessor, 
their source, indications, and suggested software response. 
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Table 3-3. Events and Conditions Handled by the Coprocessor OS 


Event or Condition 


Source 


Indication 


Suggested Coprocessor OS 
Action 


Remarks 


CPUHOT 


Raised either 
by the 
sensors in the 
die, the VR, 
or the SMC 


TMU interrupt 
and MMIO 
status register 


Hardware automatically 
throttles core frequency to a 
low value. Coprocessor OS 
resets its P-state evaluation 
algorithm, programs 
frequency and voltage to 
correspond to configurable 
values and enables the 
GoalReached interrupt. 


When the hardware 
exits the CPUHOT 
condition, it locks on 
to the frequency 
programmed by the 
coprocessor OS, and 
raises the 
GoalReached 
interrupt. 
Coprocessor OS 
restarts the P-state 
evaluation algorithm. 


SW Thermal 
threshold 1 crossed 
on the way up. 


TMU 


TMU interrupt 
and MMIO 
status register 


Coprocessor OS sets max P- 
state to P1. The new max P- 
state takes effect during the 
next P-state selection pass. 


SW Thermal 
threshold 2 crossed 
on the way up. 


TMU 


TMU interrupt 
and MMIO 
status register 


Coprocessor OS sets max P- 
state to a configurable value 
between P1 and Pn. Affects 

P-state change immediately. 


SW Thermal 
threshold 1 crossed 
on the way down. 


TMU 


TMU interrupt 
and MMIO 
status register 


Coprocessor OS sets max P- 
state to PO (turbo). The new 
max P-state takes effect 
during the next P-state 
selection pass. 


PWRLIMIT 


SMC 


I2C interrupt 


Coprocessor OS reads SMC 
power limit value and sets 
low and high water mark 
thresholds for power limit 
alerting. 


SMC will interrupt 
the coprocessor OS 
when it has a new 
power limit setting 
from the platform. 


PWRALERT 


SMC 


TMU interrupt, 
MMIO status 
register 


Raised when the card power 
consumption crosses either 
the low or the high threshold 
set by the coprocessor OS. 
The coprocessor OS adjusts 
P-state accordingly. 


Over current limit 


SVID 


Coprocessor OS P-state 
evaluation algorithm reads 
SVID current output and 
compares it to preset limits 
for modifying the P-state. 


Fan speed 


SMC 


MMIO register 


Coprocessor OS P-state 
evaluation algorithm reads 
fan speed and compares it to 
preset limits for modifying 
the P-state. 
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3.1.6 Power Management in the Intel® MPSS Host Driver 


The host driver power management (PM) component is responsible for performing PM activities in cooperation with the 
coprocessor OS on an Intel? Xeon Phi™ coprocessor. These activities are performed after receiving events or 
notifications from the control panel, the coprocessor OS, or the host operating system. The PM component in the host 
driver and the PM component in the coprocessor OS communicate using the SCIF. 


The Power Management for the host driver falls into four functional categories: 


e Control panel (Ring3 module) interface 

e Host OS power management 

e  Host-to-coprocessor OS communication and commands 
e Package states handling 


3.1.6.1 PM Interface to the Control Panel 


The Host driver implements services to collect user inputs. It is an interface (e.g., Sysfs on Linux*) by which the control 
panel reads PM status variables such as core frequency, VID, number of idle CPUs, power consumption, etc. The 
interface can also be used by other PM tools and monitoring applications to set or get PM variables. 


3.1.6.2 Host OS Power Management 


Power management works on two levels. It can be applied to the system as a whole or to individual devices. The 
operating system provides a power management interface to drivers in the form of entry points, support routines, and 
I/O requests. The Intel® MPSS host drivers conform to operating system requirements and cooperate to manage power 
for its devices. This allows the operating system to manage power events on a system wide. For example, when the OS 
sets the system to state S3; it relies upon the Intel? MPSS host driver to put the device in the corresponding device 
power state (D-state) and to return to the working state in a predictable fashion. Even if the Intel? MPSS host driver can 
manage the Intel? Xeon Phi™ coprocessor's sleep and wake cycles, it uses the operating system's power management 
capabilities to put the system as a whole into a sleep state. 


The Intel? MPSS host driver interfaces with the host operating system for power management by doing the following: 


e Reporting device power capabilities during PnP enumeration. 

e Handling power I/O requests sent by the host OS or by another driver in the device stack (applicable to Windows 
environment). 

e Powering up the Intel? Xeon Phi™ coprocessor(s) as soon as it is needed after system startup or idle shutdown. 

e Powering down the Intel? Xeon Phi™ coprocessor at system at shutdown or putting system to sleep when idle. 


Most of the power management operations are associated with installing and removing Intel? Xeon Phi™ coprocessors. 
Hence, the Intel? MPSS host driver supports Plug and Play (PnP) to get power-management notifications. 


3.1.6.2.1 Power Policies (applicable to Windows) 


You can use the Windows control panel to set system power options. The Intel? MPSS host driver registers a callback 
routine with the operating system to receive notification. As soon as a callback is registered by the driver during load, 
the OS immediately calls the callback routine and passes the current value of the power policy. Later, the OS notifies the 
host driver of the changes to the active power policy that were made through this callback. The driver then forwards the 
policy change request and associated power settings to the coprocessor OS. 
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3.1.6.3 PM Communication with the coprocessor OS 


A set of commands specifically for power management facilitate communication between the host driver and the 
coprocessor OS. These commands initiate specific PM functions or tasks, and coordinate the exchange of PM 
information. 


The Intel? MPSS host driver uses the symmetric communication interface (SCIF) layer to create a channel to send 
messages to the coprocessor OS PM component. SCIF provides networking and communication capabilities within a 
single platform. In the SCIF context, the host driver and the coprocessor OS PM components are on different SCIF 
nodes. The Intel? MPSS host driver creates a RingO-to-RingO communication queue from its own node to a "known" SCIF 
port (logical destination) on the coprocessor OS node. The message types are summarized in Table 3-4. 


Table 3-4. Power Management Messages 


Message Type Description 

Status queries Messages passed to inquire about the current PM status; for example, core 
voltage, frequency, power budget, etc. Most of this data is supplied to the control 
panel. 

Policy control Messages that control PM policies in the coprocessor OS. For example, 
enable/disable turbo, enable/disable idle package states, etc. 

Package state commands Messages used to monitor and handle package states. For example, get/set vccp, 
get entry/exit latencies, etc. 

Notifications from the The coprocessor OS notifies the4 host driver when it is going to enter an idle state 

coprocessor OS because all the cores are idle. 


3.1.6.4 Package States (PC States) Handling 


One of the main PM responsibilities of the Intel? MPSS host driver is to monitor idle states. The host driver monitors the 
amount of time that the coprocessor OS spends idle and makes decisions based on the timer's expiration. When all the 
CPUs in the Intel? Xeon Phi™ coprocessors are in core state (C1), the coprocessor OS notifies the host driver that the 
devices are ready to enter package sleep states. At this stage, the coprocessor OS goes to auto PC3 state. The 
coprocessor OS, on its own, cannot select the deeper idle states (deep PC3 and PC6). It is the responsibility of the host 
driver to request that the coprocessor OS enter a deeper idle state when it believes that the coprocessor OS has spent 
enough idle time in the current idle state (PC6 is the deepest possible idle state). 


3.1.6.4.1 Power Control State Entry and Exit Sequences 


This section summarizes the steps followed when the package enters the PC3 or the PC6 idle state. 


PC3 auto Entry : 


Receive idle state notification for auto PC3 entry from coprocessor OS. 
Wait for Intel? Xeon Phi™ coprocessor Idle/Resume flag = PC3 code. 
Verify hardware idle status. 

Set HOST Idle/Resume flag = auto PC3 code. 

Start host driver timer for auto PC3 state. 


URWNE I 


PC3_ deep Entry_: 


Make sure that the host driver auto PC3 timer has expired. 
Verify hardware idle status. 

Set VccP to minimum the retention voltage value. 

Set HOST Idle/Resume flag = deep PC3 code. 

Start the host driver timer for PC6 state. 


URWNE I 
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PC6 Entry : 


Make sure that the host driver PC6 timer has expired. 

Executethw DCH deep Exit algorithm. 

Request that the coprocessor OS to enter PC6 state. 

Receive readiness notification for PC6 entry from the coprocessor OS. 
Wait for Intel? Xeon Phi™ coprocessor Idle/Resume flag = PC6 code. 
Verify hardware idle status. 

Set VccP to zero (0) volts. 

Set HOST Idle/Resume flag = PC6 code. 


Gelle AA nl 


PC3 deep Exit : 
Set VccP to the minimum operating voltage. 


1. 
2. Wait for Intel? Xeon Phi™ coprocessor Idle/Resume flag = CO code. 
3. Set HOST Idle/Resume flag = CO code. 


PC6 Exit : 
Set VccP to the minimum operating voltage. 


1. 
2. Wait for LRB Idle/Resume flag = CO code. 
3. Set HOST Idle/Resume flag = CO code. 


3.1.6.4.2 Package State Handling and SCIF 


SCIF is the interface used for communication between the host software and the coprocessor OS software running on 
one or more Intel® Xeon Phi™ coprocessors. SCIF is also used for peer-to-peer communication between Intel? Xeon Phi™ 
coprocessors. This interface could potentially (for speed and efficiency reasons) be based on a distributed shared 
memory architecture where peer entities on the host and the Intel? Xeon Phi™ coprocessor share messages by directly 
writing to each other's local memory (Remote Memory Access). The host driver takes into account the SCIF 
communication channels that are open on an Intel® Xeon Phi™ coprocessor when deciding to put it into a deeper 
package idle state. 


3.1.6.4.3 Boot Loader to Host Driver Power Management Interface 


The boot loader executes when power is first applied to the device, but can also run when exiting from PC6 idle states 
due to the removal of the VccP power rail. The boot-loader component for Intel? Xeon Phi™ coprocessors has a PM- 
aware abbreviated execution path designed specifically for exiting D3 and PC6 states, minimizing the time required to 
return the Intel? Xeon Phi™ coprocessor to full operation from D3 and PC6. To support PC6 exit, the host driver interacts 
with the boot loader via the scratchpad registers. 
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Figure 3-8 Intel® MPSS Host Driver to Coprocessor OS Package State Interactions 
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3.2 Virtualization 


A platform that supports virtualization typically has a Virtual Machine Manager (VMM) that hosts multiple Virtual 
Machines. Each virtual machine runs an OS (Guest OS) and application software. Different models exist for supporting 
I/O devices in virtualized environments, and the Intel? Xeon Phi™ coprocessor supports the direct assignment model 
wherein the VMM directly assigns the Intel? Xeon Phi™ coprocessor device to a particular VM and the driver within the 
VM has full control with minimal intervention from the VMM. The coprocessor OS does not require any modifications to 
support this model; however, the chipset and VMM are required to support the following Intel VT-d (Intel Virtualization 
Technology for Direct I/O) features: 


e Hardware-assisted DMA remapping 
e  Hardware-assisted interrupt remapping 
e Shared device virtualization 


3.2.1 Hardware Assisted DMA Remapping 


In virtualized environments, guests have their own view of physical memory (guest physical addresses) that is distinct 
from the host's physical view of memory. The guest OS Intel? Xeon Phi™ coprocessor device driver (and thus the 
coprocessor OS on the Intel? Xeon Phi™ coprocessor dedicated to the guest) only knows about guest physical addresses 
that must be translated to host physical addresses before any system memory access. Intel VT-d (implemented in the 
chipset) supports this translation for transactions that are initiated by an I/O device in a manner that is transparent to 
the I/O device (i.e., the Intel? Xeon Phi™ coprocessor). It is the VMM's responsibility to configure the VT-d hardware in 
the chipset with the mappings from guest physical to host physical addresses when creating the VM. For details refer to 
the Intel VT for Direct I/O Specification (Intel? Virtualization Technology for Directed I/O, 2011). 


3.2.2 Hardware Assisted Interrupt Remapping 


In a virtualized environment with direct access, it is the guest and not the host VMM that should handle an interrupt 
from an I/O device. Without hardware support, interrupts would have to be routed to the host VMM first which then 
injects the interrupt into the guest OS. Intel VT-d provides Interrupt remapping support in the chipset which the VMM 
can use to route interrupts (either UO APIC generated or MSIs) from specific devices to guest VMs. For details refer to 
the Intel VT for Direct I/O specification. 


3.2.3 Shared Device Virtualization 


Each card in the system can be either dedicated to a guest OS or shared among multiple guest operating systems. This 
option requires the highest level of support in the coprocessor OS as it can service multiple host operating systems 
simultaneously. 


3.3 Reliability Availability Serviceability (RAS) 


RAS stands for reliability, availability, and serviceability. Specifically, reliability is defined as the ability of the system to 
perform its actions correctly. Availability is the ability of the system to perform useful work. Serviceability is the ability of 
the system to be repaired when failures occur. Given that HPC computing tasks may require large amounts of resources 
both in processing power (count of processing entities or nodes) and in processing time, node reliability becomes a 
limiting factor if not addressed by RAS strategies and policies. This section covers RAS strategies available in software on 
Intel? Xeon Phi™ coprocessor and its host-side server. 


In HPC compute clusters, reliability and availability are traditionally handled in a two-pronged approach: by deploying 
hardware with advanced RAS features to reduce error rates (as exemplified in the Intel? Xeon? processors) and by 
adapting fault tolerance in high-end system software or hardware. Common software-based methods of fault tolerance 
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are to deploy redundant cluster nodes or to implement snapshot and restore (check pointing) mechanisms that allow a 
cluster manager to reduce data loss when a compute node fails by setting it to the state of last successful snapshot. 
Fault tolerance, in this context, is about resuming from a failure with as much of the machine state intact as possible. It 
does not imply that a cluster or individual compute nodes can absorb or handle failures without interrupting the task at 
hand. 


The Intel? Xeon Phi™ coprocessor addresses reliability and availability the same two ways. Hardware features have 
been added that improve reliability; for example, ECC on GDDR and internal memory arrays that reduce error rates. 
Fault tolerance on Intel? Xeon Phi™ coprocessor hardware improves failure detection (extended machine check 
architecture, or MCA). Managed properly, the result is a controlled and limited degradation allowing a node to stay in 
service after certain anticipated hardware failure modes manifest themselves. Fault tolerance in Intel? Xeon Phi™ 
coprocessor software is assisted by the Linux* coprocessor OS, which supports application-level snapshot and restore 
features that are based on BLCR (Berkeley Labs Checkpoint Restart). 


Intel? Xeon Phi™ coprocessor approach to serviceability is through software redundancy (that is, node management 
removes failing compute nodes from the cluster), and has no true hardware redundancy. Instead software and firmware 
features allow a compute node to reenter operation after failures at reduced capacity until the card can be replaced. 
The rationale behind this ‘graceful’ degradation strategy is the assumption that an Intel? Xeon Phi™ coprocessor unit 
with, say one less core, will be able to resume application snapshots and therefore is a better proposition to the cluster 
than removing the node entirely. 


A hardware failure requires the failing card to be temporarily removed from the compute cluster it is participating in. 
After a reboot, the card may rejoin the cluster if cluster management policies allow for it. 


The Intel? Xeon Phi™ coprocessor implements extended machine check architecture (MCA) features that allow software 
to detect and act on detected hardware failures in a manner allowing a 'graceful' degradation of service when certain 
components fail. Intel? Xeon Phi™ coprocessor hardware reads bits from programmable FLASH at boot time, which may 
disable processor cores, cache lines, and tag directories that the MCA has reported as failing. 


3.3.1 Check Pointing 


In the context of RAS, check pointing is a mechanism to add fault tolerance to a system by saving its state at certain 
intervals during execution of a task. If a non-recoverable error occurs on that system, the task can be resumed from the 
last saved checkpoint, thereby reducing the loss caused by the failure to the work done since the last checkpoint. In HPC, 
the system is the entire cluster, which is defined as all the compute nodes participating in a given HPC application. 
Cluster management controls where and when checkpoints occur and locks down its compute nodes prior to the 
checkpoint. The usual mode of operation is for checkpoints to occur at regular intervals or if system monitoring 
determines that reinstating a checkpoint is the proper course of action. Individual compute nodes are responsible for 
handling local checkpoint and restore (C/R) events, which have to be coordinated in order to establish a cluster-wide 
coherent C/R. Conceptually check pointing can be handled in two ways: 


e a checkpoint contains the state of the entire compute node, which includes all applications running on it (similar 
to hibernate) 

e oracheckpoint contains the state of a single program running on the compute node, which is referred to as system 
or application checkpoints. 


Application check pointing is by far the most widespread method; it is simpler to implement, produces smaller snapshot 
images, and may have uses beyond fault tolerance, such as task migration (create snapshot of one system, terminate the 
application, and restart it on another system) and gang scheduling. These alternate uses are limited to cluster nodes 
running the same OS and running on similar hardware. System checkpoints are, for all practical purposes, locked to the 
system it was taken on. 
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The remainder of this section addresses the basics of BLCR and its integration into the Intel? Xeon Phi™ coprocessor. 
BLCR details are available at the following links: 


e  http://crd.Ibl.gov/^jcduell/papers/blcr.pdf 

e  https://upc-bugs.lIbl.gov//blcr/doc/html/FAQ.htmlitbatch 

e  https://upc-bugs.Ibl.gov//blcr/doc/html/BLCR Admin Guide.html 
e  https://upc-bugs.Ibl.gov//blcr/doc/html/BLCR Users Guide.html 


3.3.2 Berkeley Labs Check point and Restore (BLCR) 


Due to the altered ABI required for the Linux* coprocessor OS, BLCR is recompiled specifically for the Intel? Xeon Phi™ 
coprocessor, but otherwise no changes are required for BLCR except for the kernel module. The kernel module 
incorporates additional process states provided by Intel? Xeon Phi™ coprocessor hardware (the vector registers). 


Beyond the enhanced register set, the BLCR kernel module is not different. A patch set for BLCR version 0.8.2 (the 
latest) exists for the Linux* kernel 2.6.34 and has been shown to build correctly on a standard Linux* system. 


BLCR software is, by design, limited to creating a checkpoint for a process (or process group) running under a single 
operating system. In larger clusters, where the compute workload is spread over several cooperating systems, a 
checkpoint of a single process does not result in any fault tolerance because the state of that process would soon be out 
of synchronization with the rest of the cluster (due to inter process messaging). Therefore, a checkpoint within a cluster 
must be coordinated carefully; e.g., by creating checkpoints of all participants in compute task simultaneously during a 
lock-down of interprocess communications. Cluster management software must support C/R and implement a method 
either for putting all participants into a quiescent state during the checkpoint (and to restore all if a participant fails to 
create one) or for providing a protocol to put each node into a restorable state before the checkpoint occurs. 


MPI stacks supporting BLCR have built-in protocols to shut down the IPC between compute nodes and to request a 
checkpoint to be created on all participants of a ‘job’. 


Locally, BLCR offers either a cooperative approach or a non-cooperative approach for very simple applications. With the 
cooperative approach, the application is notified before and after a checkpoint is created. The cooperative approach is 
intended to give checkpoint-aware applications a way to save the state of features known not to be preserved across a 
C/R event. The design of BLCR deliberately leaves out the handling of process states that cannot be implemented well 
(to avoid instability), such as TCP/IP sockets, System-V IPC, and asynchronous I/O. If any of these features are used by 
the application, they must be brought into a state that allows the application to recreate them after a restore event. 


BLCR relies on kernel-assisted (kernel module required) methods to retrieve a useful process state. A BLCR library must 
be linked to the application in order to establish communication between the application and the kernel module, and to 
run a private thread within the application that handles call-outs before and after C/R events. 


An application process gets notification from the BLCR kernel module though a real time signal so that it can protect its 
critical regions by registering callbacks to clean house before the checkpoint data is written to file. Upon restart, the 
same callbacks allow the process to restore internal settings before resuming operations. 


The result of a BLCR checkpoint is an image file containing all process state information necessary to restart it. A 
checkpoint image can be quite large, potentially as large as the node's available memory (swap plus RAM). The Intel® 
Xeon Phi™ coprocessor does not have local persistent storage to hold checkpoint images, which means they must be 
shipped to the host (or possibly beyond) over a networked file system to a disk device. 


Analysis of BLCR implementations shows that I/O to the disk device is the most time consuming part of check pointing. 
Assuming the checkpoint images go to the local host's file system, the choice of file system and disk subsystem on the 
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host become the key factors on checkpoint performance. Alternatives to spinning disks must be considered carefully, 
though it does not impact the C/R capability and is outside the scope of BLCR. 


The BLCR package provides three application programs and a library (plus includes) for building check pointing 
applications. The BLCR library contains local threads that allow the application some control over when a checkpoint can 
take place. A simple API lets parts of the application prepare for a checkpoint independently. The mechanism is to 
register functions like the following with the BLCR library during process initialization: 


Void my callback(void *data ptr) 
{ 
struct my data *pdata = (struct my data*) data ptr; 
int did restart; 
// do checkpoint-time shutdown logic 
// tell system to do the checkpoint 
did restart = cr checkpoint (); 
if (did restart) 
// we've been restarted from a checkpoint 
else 
// we're continuing after being backed up 


The local BLCR thread calls all registered callbacks before the kernel module checkpoints the application from a local 
thread. Once all callbacks have called with cr checkpoint(), the local BLCR thread signals the kernel module to proceed 
with the checkpoint. After the checkpoint, cr checkpoint() returns to the callback routines with information on 
whether a restart or checkpoint took place. 


3.3.2.1  BLCR and SCH 


SCIF is a new feature in the Linux* based coprocessor OS and so has no support in the current BLCR implementation. 
SCIF has many features in common with sockets. Therefore, BLCR handling of open SCIF connections is treated the same 
way as open sockets; that is, not preserved across C/R events. 


The problem area for sockets is that part of the socket state might come from data present only in the kernel's network 
stack at the time of checkpoint. It is not feasible for the BLCR kernel module to retrieve this data and stuff it back during 
a later restore. 


The problems for SCIF are the distribution of data in the queue pair and the heavy use of references to physical 
addresses in the PCI-Express* domain. It is not feasible to rely on physical locations of queue pairs being consistent 
across a Linux* coprocessor OS reboot, and SCIF is not designed to be informed of the location of queue pairs. 


3.3.2.2 Miscellaneous Options 


Some aspects of BLCR on the Intel? Xeon Phi™ coprocessor are linked to the applied usage model. In the Intel® MIC 
Architecture coprocessing mode, this requires a decision as to what a checkpoint covers. In this mode, only the host 
participates (by definition) as a node in a compute cluster. If it is compatible with compute clusters and C/R is used 
within the cluster, then only the host can be asked to create a checkpoint. The host must act as a proxy and delegate 
BLCR checkpoints to the Intel? Xeon Phi™ coprocessor cards as appropriate and manage the checkpoint images from 
Intel? Xeon Phi™ coprocessors in parallel with its own checkpoint file. 


Another, and less complicated approach, is to terminate tasks on all Intel? Xeon Phi™ coprocessors before creating a 


check point on itself. The tradeoff is between complexities vs. compute time to be redone, depending on the average 
task length, as part of resuming from check pointing. 
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Intel? Xeon Phi™ coprocessors used in an Intel? Xeon® offload or autonomous mode do not face this problem because 
each card is known to the cluster manager that dispatches check point requests to cards individually. The host is a 
shared resource to the Intel? Xeon Phi™ coprocessors and is not likely to be part of the check pointing mechanism. 


Check pointing speed has been identified as a potential problem, mostly because the kernel module that performs the 
bulk of the state dump is single threaded. Work has been done in the MPI community to speed this up, but the 
bottleneck appears to be the disk driver and disk I/O, not the single threading itself. Several references point to PCI- 
Express*-based battery backed memory cards or to PCI-Express*-based Solid State Drive (SSD) disks as a faster medium 
for storing checkpoint images. It is trivial to make the host use these devices to backup networked file systems used by 
the Linux* coprocessor OS, but access still has to go through the host. It may be more effective to let the Intel? Xeon 
Phi™ coprocessors access these devices directly over PCI-Express*, but that approach requires that the device be 
independently accessible from multiple peer Intel? Xeon Phi™ coprocessors and that device space be divided 
persistently between Intel® Xeon Phi™ coprocessors such that each has its own fast-access file system dedicated to 
checkpoint images. 


3.3.3 Machine Check Architecture (MCA) 


Machine Check Architecture is a hardware feature enabling an Intel? Xeon Phi™ coprocessor card to report failures to 
software by means of interrupts or exceptions. Failures in this context are conditions where logic circuits have detected 
something out of order, which may have corrupted processor context or memory content. Failures are categorized by 
severity as either DUEs or CEs: 
=  DUEs (Detected Unrecoverable Errors) are errors captured by the MC logic but the corruption cannot be 
repaired and the system as a whole is compromised; for example, errors in L1 cache. 
= CEs (Corrected Errors) are errors that have occurred and been corrected by the hardware, such as single bit 
errors in L2 ECC memory. 


3.3.3.1 MCA Hardware Design Overview 


Standard IA systems implement MCA by providing two mechanisms to report MC events to software: MC exceptions 
(#18) for events detected in the CPU core and NMI (#2) interrupts for events detected outside of the CPU core (uncore). 


Specifics on occurred MC exceptions are presented in MSR banks, each representing up to 32 events. The processor 
capability MSRs specify how many banks are supported by a given processor. The interpretation of data in MSR banks is 
semi-standardized; that is, acquiring detailed raw data on an event is standardized but the interpretation of acquired 
raw data is not. The Intel? Xeon Phi™ coprocessor provides three MC MSR banks. 


MC events signaled through the NMI interrupt on standard IA systems come from the chipsets and represent failures in 
memory or I/O paths. Newer CPUs with built-in memory controllers also provide a separate interrupt for CEs (CMCIs) 
that have built-in counter dividers to throttle interrupt rates. This capability is not provided on the Intel? Xeon Phi'" 
coprocessor. Instead, the Intel? Xeon Phi™ coprocessor delivers both uncorrected and corrected errors that are 
detected in the core domain via the standard MCA interrupt (#18). Machine check events that occur in the uncore 
domain are delivered via the SBox, which can be programmed to generate an NMI interrupt targeted at one or all 
threads. The Uncore Interrupt includes MC events related to the PCI-Express interface, Memory Controller (ECC and 
link training errors), or other uncore units. There is no CE error rate throttle in the Intel? Xeon Phi™ coprocessor. The 
only remedy against high error frequencies is to disable the interrupt at the source of the initiating unit (L2/L1 Cache, 
Tag Directory, or GBox). 


The NMI interrupt handler software must handle a diverse range of error types on Intel? Xeon Phi™ coprocessor. 
Registers to control and report uncore MC events on Intel? Xeon Phi™ coprocessor differ significantly from registers on 
standard IA chipsets, which means that stock operating systems have no support for uncore MC events on an Intel? 
Xeon Phi™ coprocessor. 
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3.3.3.2 MCA Software Design Overview 


Intel? Xeon Phi™ coprocessor RAS demands that the software perform MC event handling in two stages, event data 
gathering and event post processing. 


The first stage (which takes place in the Linux* coprocessor OS) receives MC event notifications, collects raw data, and 
dispatches it to interested parties (i.e., an MCA agent running on the host and the on-card SMC controller). If the 
coprocessor OS can resume operation, then its event handling is completed. Otherwise, the MC event handler notifies 
the host separately that its internal state has been corrupted and a reboot is required. 


An unrelated service for host-side monitoring of the Intel? Xeon Phi™ coprocessor card state will be added to the MCA 
handling routines. This service will act as a gateway between host side 'in-band' platform management and the SMC 
sub-system and respond to system state queries, such as memory statistics, free memory, temperatures, CPU states etc. 
Host queries of the coprocessor OS MCA log is a part of the service too. 


3.3.3.3 MC Event Capture in Linux* 2.6.34 


The stock Linux* kernel has support for core MCs in a single dedicated exception handler. The handler expects MCA 
exceptions to be broadcast to all processors in the system, and it will wait for all CPUs to line up at a rendezvous point 
before every CPU inspects its own MCA banks and stores flagged events in a global MC event log (consisting of 32 
entries). Then the handler on all CPUs lines up at a rendezvous point again and one CPU (the monarch, which is selected 
as the first entering the MCA event handler) gets to grade the MCA events collected in the global MC event log and to 
determine whether to panic or resume operation. This takes place in function monarch reign(). If resumed, the MCA 
handler may send BUS-ERROR signals to the processes affected by the error. Linux* has several kernel variables that 
control sensitivity to MCA exceptions, ranging from always panic to always ignore them. 


Linux* expects MC events to be broadcast to all CPUs. The rendezvous point uses CPU count versus event handler 
entries as wait criteria. The wait loop is implemented as a spinlock with timeout, such that a defunct CPU cannot prevent 
the handler from completing. 


NMI interrupts on Linux* are treated one way for the boot processor (BP) and differently on the application processors 
(AP). Signals from the chipset are expected to be routed only to the BP and only the BP will check chipset registers to 
determine the NMI source. If chipset flags SERR# or IOCHK are set the BP NMI handler consults configurable control 
variables to select panic or ignore the MC event. Otherwise, and on APs, the NMI handler will check for software 
watchdog timers, call registered NMI handlers, or if not serviced then a configurable control variables to select panic or 
ignore the unknown NMI. 


3.3.3.4 MC Handling in the Intel? Xeon Phi™ Coprocessor Linux*-based coprocessor OS 


The Linux* coprocessor OS MCA logic handles capture of core MC events on the Intel? Xeon Phi™ coprocessor without 
modifications if events are broadcast to all CPUs the same way as on standard IA systems. A callout is required from 
monarch reign() to a new module for distribution of MC event reports to other interested parties (such as the SMC and 
the host side MC agent). After distributing the MC events, the Linux* coprocessor OS uses the grading result to select 
between CEs that resume operation immediately and DUEs that must request a reboot to maintenance mode and then 
cease operation. Another callout from monarch reign() is required for this part. 


Handling of NMIs in the Linux* coprocessor OS requires new code because uncore MCA registers are completely 
different from those of chipset MCA; for example, MMIO register banks vs. I/O registers. Uncore MCA registers are 
organized similarly to core MCA banks, but the access method for 32-bit MMIO vs. 64-bit MSRs differs sufficiently to 
make a merge into the MCA exception handler code unfeasible. However, the global MC event log, the use of 

monarch reign(), and the event signaling to the host side MCA agent should be the same for the NMI handler as it is for 
the MC exception handler. 
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3.3.3.5 MCA Event Sources and Causes 


MCA events are received from three sources on the ring: the CPU box, the GBox, and the SBox. For more information on 
the encoding and controls available on the MCA features, refer to Section 3.3.3.8. 


3.3.3.6 MCA Event Post-Processing (coprocessor OS Side Handling) 


Once the MC event(s) has been collected into the global MC event log and graded, the failure has been classified as 
either a DUE or CE. Both event types are distributed to the host and the SMC, potentially with some form of throttling or 
discrimination based on user configurable settings (via the kernel command line as a boot parameter or at runtime 
through the control panel). 


On CE type failures, the Intel® Xeon Phi™ coprocessor will resume operation because the hardware state is intact. DUE 
failures cannot be ignored and the next action is to signal the host for a reboot into maintenance mode. 


These activities are initiated by callbacks from a special routine and the NMI exception handler. The processing context 
is exception or interrupt. Both of these require careful coding because locking cannot be relied on for synchronization, 
even to a separate handler thread. The stock Linux* reaction to a DUE is simply to panic. On the Intel® Xeon Phi™ 
coprocessor, the recorded events must be distributed to at least two parties, both of which are based on non-trivial APIs 
(the 12C driver for reporting to the SMC and the SCIF driver for reporting to the host-side MC agent). 


3.3.3.7 MCA Event Post-Processing (Host Side Handling) 


There are several active proposals on what sort of processing is required for MC events. The Linux* coprocessor OS will 
capture events in raw form and pass them to an agent on the host for further processing. 


The host side MCA agent is a user space application using dedicated SCIF connections to communicate with the Intel® 
Xeon Phi™ coprocessor Linux* coprocessor OS MCA kernel module. The agent is responsible for the following: 


e Maintaining and providing access to a permanent MC event log on the host, preferably as a file on the host’s local 
file system. This agent also handles the distribution of events beyond the host. 

e Providing a means to reset (or to trigger a reset) of an Intel® Xeon Phi™ coprocessor card into maintenance mode 
and passing the latest MC event. The card reset needs support by the host side Intel® Xeon Phi™ coprocessor driver 
since ringO access is required. 

e Optionally providing access to the Intel® Xeon Phi™ coprocessors global MC event log 

e Acting as the host side application; that is, the RAS defect analyzer providing an interface to dump MCA error 
records from the EEPROM. 


The design of the host side MCA agent is beyond the scope of this document. It must place as much content as possible 
as a user mode application in order to keep the host side drivers as simple and portable as possible. It shall be noted 
that it has been requested to have sysfs nodes on Linux* hosts present Intel? Xeon Phi™ coprocessor card properties, 
including MC event logs and states. This may require a kernel agent on the host side to provide the sysfs nodes. 


Beyond the overlap of features between driver and user mode agent, this also has issues with SCIF because only one 
party can own a SCIF queue pair. Having separate SCIF links for the kernel driver and user space agent is not feasible. 
The host side MCA agent may split into a kernel driver to provide the sysfs nodes and a user space application using the 
sysfs nodes, where only the kernel driver use SCIF. 


3.3.3.8 Core CPU MCA Registers (Encoding and Controls) 


While the Intel? Xeon Phi™ coprocessor does support MCA and MCE capabilities, the CPUID feature bits used to identify 
the processor supports for these features are not set on the Intel? Xeon Phi™ coprocessor. 
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The Intel? Xeon Phi™ coprocessor implements a mechanism for detecting and reporting hardware errors. Examples of 
these errors include on-chip cache errors, memory CRC errors, and I/O (PCI Express) link errors. The Intel? Xeon Phi™ 
coprocessor uses sets of MSR registers to setup machine checking as well as to log detected errors. 


Machine checks on the Intel? Xeon Phi™ coprocessor are broken down into two domains: 
= Core machine check events, which are handled in a similar fashion to the IA MCA architecture definition 
" System machine check events, which are handled in a similar fashion to chipset machine check events 


Machine-check event delivery on the Intel? Xeon Phi™ coprocessor is not guaranteed to be synchronous with the 
instruction execution that may have caused the event. Therefore, recovery from a machine check is not always possible. 
Software is required to determine if recovery is possible, based on the information stored in the machine-check 
registers. 


The Intel? Xeon Phi™ coprocessor MCA implements one set of MC general registers per CPU (core control registers). 
There are three banks of MCx registers per core. All hardware threads running on a core share the same set of registers. 
These registers are for the L1 cache, the L2 cache and the Tag Directories. For the uncore sections, there is one bank of 
registers per box (GBox, SBox, etc.), each of which is composed of eight 32-bit registers. All uncore events are sent over 
a serial link to the SBox’s I/O APIC. From the I/O APIC, an interrupt is sent to a core, after which normal interrupt 
processing occurs. 


The machine check registers on the Intel? Xeon Phi™ coprocessor consist of a set of core control registers, error 
reporting MSR register banks, and global system error reporting banks containing error status for the RAS agents. Most 
core machine-check registers are shared amongst all the cores. The machine-check error reporting registers are listed in 
Table 3-5. 

Table 3-5. Control and Error Reporting Registers 


Intel? Xeon Phi™ Coprocessor Machine Check Control Registers 
Register Name 


Core machine check control register (per thread 
register) 
Intel? Xeon Phi™ Coprocessor Machine Error Reporting Registers 


Register Name Register Name Register Name 


Machine check control register 


64 Machine check status register 


Machine check address register 


MCi MISC Not Implemented in every MC bank 


3.3.3.8.1 MCI CTL MSR 


The MCi CTL MSR controls error reporting for specific errors produced by a particular hardware unit (or group of 
hardware units). Each of the 64 flags (EEj) represents a potential error. Setting an EEj flag enables reporting of the 
associated error, and clearing it disables reporting of the error. Writing the 64-bit value FFFFFFFFFFFFFFFFH to an 
MCI CTL register enables the logging of all errors. The coprocessor does not write changes to bits that are not 
implemented. 

Table 3-6. MCi CTL Register Description 


Field Name Bit Range 


63) 
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3.3.3.8.2 MCi STATUS MSR 


The MCi STATUS MSR contains information related to a machine check error if its VAL (valid) flag is set. Software is 
responsible for clearing the MCi STATUS register by writing it with all 0’s; writing 1's to this register will cause a general- 
protection exception to be generated. The fields in this register are as follows (see also Table 3-7): 


=" MCA (machine-check architecture) error code field, bits 0 through 15 
Specifies the machine-check architecture defined error code for the machine-check error condition detected. 

=  Model-specific error code field, bits 16 through 31 
Specifies the model-specific error code that uniquely identifies the machine-check error condition detected. 

= Other information field, bits 32 through 56 
The functions of the bits in this field are implementation specific and are not part of the machine-check 
architecture. 

= PCC (processor context corrupt) flag, bit 57 
Indicates (when set) that the state of the processor might have been corrupted by the detected error condition 
and that reliable restarting of the processor may not be possible. When clear, this flag indicates that the error 
did not affect the processor's state. 

=  ADDRV IMC ADDR register valid) flag, bit 58 
Indicates (when set) that the MCi ADDR register contains the address where the error occurred. When clear, 
this flag indicates that the MCi ADDR register does not contain the address where the error occurred. 

=  MISCV IMC MISC register valid) flag, bit 59 
Indicates (when set) that the MCi MISC register contains additional information regarding the error. When 
clear, this flag indicates that the MCi MISC register does not contain additional information regarding the 
error. 

" EN (error enabled) flag, bit 60 
Indicates (when set) that the error was enabled by the associated EEj bit of the MCi CTL register. 

= UC (error uncorrected) flag, bit 61 
Indicates (when set) that the processor did not or was not able to correct the error condition. When clear, this 
flag indicates that the processor was able to correct the error condition. 

= OVER (machine check overflow) flag, bit 62 
Indicates (when set) that a machine-check error occurred while the results of a previous error were still in the 
error-reporting register bank (that is, the VAL bit was already set in the MCi STATUS register). The processor 
sets the OVER flag and software is responsible for clearing it. 

=  VAL(MCi STATUS register valid) flag, bit 63 
Indicates (when set) that the information within the MCi STATUS register is valid. When this flag is set, the 
processor follows the rules given for the OVER flag in the MCi STATUS register when overwriting previously 
valid entries. The processor sets the VAL flag and software is responsible for clearing it. 


The VAL bit is only set by hardware when an MC event is detected and the respective MC enable bit in the 
MCi.CTL register is set as well. Software should clear the MC3 STATUS.VAL bit by writing all O's to the 
MCi STATUS register. 


Table 3-7. MCI STATUS Register Description 


7 jw 
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Field Name Bit Range 


OVER RW 
VAL MCi STATUS register valid R/W 


3.3.3.8.3 MCi_ADDR MSR 


The MCi ADDR MSR contains the address of the code or data memory location that produced the machine-check error if 
the ADDRV flag in the MCi STATUS register is set. The address returned is a physical address on the Intel? Xeon Phi"" 
coprocessor. 


Table 3-8. MCi_ADDR Register Description 


Field Name Bit Range 
[addres | mo | Address associated with error event 


63:n Reserved (where n is implementation specific) R/W 


3.3.3.8.4 MCi MISC MSR 


The MCi_MISC MSR contains additional information describing the machine-check error if the MISCV flag in the 
MCi_STATUS register is set. This register is not implemented in the MCO error-reporting register banks of the Intel® 
Xeon Phi™ coprocessor. 


3.3.3.8.5 Summary of Machine Check Registers 


Table 3-9 describes the Intel® Xeon Phi™ Coprocessor MCA registers. 


Table 3-9. Machine Check Registers 


MSR/MMIO Register Name Description 
Address 


Core MCA Registers 


[iH Jusen | 64 | Core machine check capability register == | 
izan [mce STATUS | 64 —coremachinecheckstatusregster ———— | 
[i [Mos cn | 64  toemaimechekconoregster —  — | 
[4p — [meoc | 64 | Core machine check control register == 
[AoiH [meo STATUS | 64 | Core machine checkstatusregister — —  — | 
402H wa ADR | 64 — | Core machine check address register (Not Implemented] | 
2031 


Intel? Xeon Phi"" coprocessor MCA Registers 
404H 
405H [MCI STATUS | oa |L2Cachemachinecheckstatusregister — | 
406H |MCL ADDR ` | 64 |L2Cache machine check address register — | 
407H 


408H MC2 CTL TAG Directory machine check control register 


: 
409H MC2 STATUS | 64 TAG Directory machine check status register 
40AH MC2 ADDR | 64 TAG Directory machine check address register 
40BH MC2 MISC TAG Directory (Not Implemented) 
Uncore MCA Registers (#18 MCA interrupt not generated. Signalling via local interrupt controller) 
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MSR/MMIO Register Name Description 
Address 


SBox MCA Registers 


0x8007DABOO SBox MCA Interrupt Status Register (Not retained on warm reset) 

0x8007DAB04 
0x8007C0340 
0x8007C0348 |RTD MCX STATUS | e |TAG Directory machine check statusregister — | 
0x8007C0350 |RTD MCX ADDR | e [TAG Directory machine check address register — | 


0x800620340 
0x800620348 —[RTD MCX STATUS | 64 | TAG Directory machine check status register == 
0x800620350  [RTD.MOLADDR | 64 | TAG Directory machine check address register | 
Memory Controller (Gbox0) MCA Registers 
0x8007A005C 
0x8007A0060 


i 
E 

0x8007A0064 
0x8007A0068 
0x8007A006C 
0x8007A0070 
0x8007A0074 
: 

0x8007A097C 
0x80079005C 
0x800790060 


0x8007A017C MCA CRCO ADDR 


0x800790064 MCX STATUS LO Gbox1 Fbox machine check status register 
0x800790068 MCX STATUS HI Gbox1 Fbox machine check status register 


0x800790070 
0x800790074 
0x80079017C 
0x80079097C 
Memory Controller (Gbox2) MCA Registers 
0x80070005C 
0x800700060 
0x800700064 
0x800700068 
i 
i 


0x80079006C 


0x80070006C MCX ADDR LO Gbox2 Fbox machine check address register 
0x800700070 MCX ADDR HI Gbox2 Fbox machine check address register 
0x800700074 MCX MISC ion ti 


Gbox2 Fbox Misc (Transaction timeout register) 
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MSR/MMIO Register Name Description 
Address 


0x80070017¢ eege na 
0x80070097¢ 
ox8006F005C 


E 
0x8006F0060 
0x8006F0064 
0x8006F0068 
0x8006F006C 
0x8006F0070 
0x8006F0074 
Ox8006F017C 
Ox8006F097C 

0x8006D005C 
0x8006D0060 


i 
i 

0x8006D0064 
0x8006D0068 
0x8006D006C 
0x8006D0070 


Gbox4 Mbox1 CRC address capture register 


0x8006D0074 MCX MISC 


Memory Controller (Gbox5) MCA Registers 


ox8006C005C 
0x8006C0060 
0x8006C0064 
0x8006C0068 
0x8006C006C 
0x8006C0070 
0x8006C0074 
ox8006C017C 
0x80060097¢ 


0x8006B005C MCX CTL LO Gbox6 Fbox machine check control register 


0x8006D097C 
0x800680060 
0x800680064 
0x800680068 
0x80068006C 
0x800680070 
0x8006B0074 
0x80068017C 
0x80068097C 

0x8006A005C 


0x8006A0060 MCX CTL HI Gbox7 Fbox machine check control register 
E 


0x8006D017C MCA CRCO ADDR 


0x8006A0064 MCX STATUS LO Gbox7 Fbox machine check status register 
0x8006A0068 MCX STATUS HI Gbox7 Fbox machine check status register 
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MSR/MMIO Register Name Description 
Address 


Ox8006A006C | MCX ADDR LO Gbox7 Fbox machine check address register 
0x8006A0070 MCX ADDR HI Gbox7 Fbox machine check address register 


0x8006A0074 | MCX MISC Gbox7 Fbox Misc (Transaction timeout register) 
Ox8006A017C | MCA CRCO ADDR Gbox7 Mbox0 CRC address capture register 
0x8006A097C | MCA CRC1 ADDR Gbox7 Mbox1 CRC address capture register 


3.3.3.9 Uncore MCA Registers (Encoding and Controls) 


The Intel? Xeon Phi™ coprocessor's uncore agents (which are not part of the core CPU) signal their machine-check 
events via the I/O APIC, and log error events via agent-specific error control and logging registers. These registers are 
implemented as registers in Intel? Xeon Phi™ coprocessor MMIO space associated with each uncore agent that is 
capable of generating machine-check events. 


Once an error is detected by an uncore agent, it signals the interrupt controller located in the uncore system box (SBox). 
The SBox logs the source of the error and generates an interrupt to the specified LRB programmed in the APIC 
redirection tables. 


Software must check all the uncore machine-check banks to identify the source of the uncore machine-check event. To 
enable the generation of a machine-check event from a given source, the software should set the corresponding bit in 
the SBox MCA Interrupt Enable Register (MCA INT EN). To disable the generation of machine-check events from a 
given source, the software should clear the corresponding bit of the SBox MCA INT EN register. 


Sources of uncore machine-check events in the Intel? Xeon Phi™ coprocessor uncore are listed in Table 3-10. 


Table 3-10. Sources of Uncore Machine-Check Events 


Uncore Agent Name 
System agent 


Tag Directory Tag Directories not collocated with CPU slice 
[Gex | | . 8  ]Memoytontoler 


Each uncore agent capable of generating machine-check events contains event control and logging registers to facilitate 
event detection and delivery. 


3.3.3.9.1 System Agent (SBox) Error Logging Registers 


The SBox contains a set of machine check registers similar to core bank registers, but implemented in MMIO (USR 
register) for reporting errors. Machine check events from the SBox are routed to the OS running on a specified thread 
via the local APIC in the SBox. The SBox local APIC redirection table assigned to MCA interrupts must be programmed 
with a specific thread in order to service SBox (and other uncore) machine-check events. Errors related to DMA requests 
are handled directly by the affected DMA Channel itself and are reported to the DMA SW Driver via the local I/O APIC or 
by the System Interrupt logic, depending on the assigned ownership of the channel. All MCA errors detected by the 
SBox are logged in the SBox MCA logging registers (MCx.STATUSx, MCx.MISCx, and MCx.ADDR) regardless of whether 
the corresponding MCA CTL bit is set, the exception being when the MCA STATUS.EN bit is already set. Only errors 
with their corresponding bit set in the MCx.CTL register can signal an error. 


Table 3-11. SBox Machine Check Registers 


Register Name [Sue(bits) | Description — — — — | 
MCX CTL LO Machine check control register 
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MCX CTL HI 32 Machine check control register (Reads 0, Writes 
Dropped, Not implemented on coprocessor) 


MCX ADDR LO Machine check address register 
MCX ADDR HI Machine check address register 


MCX MISC 


MCX MISC2 


3.3.3.9.2 Multiple Errors and Errors Over Multiple Cycles 


Misc (timeout address register) 


There are two cases in which the SBox may receive two or more errors before the software has a chance to process each 


individual error: 


5. Multiple errors (errors occurring simultaneously). This occurs when multiple error events are detected in the 
same cycle. Essentially, this allows the hardware not to try and decode and prioritize multiple errors that occur 


in the same cycle. 


6. Errors occurring one after another over multiple cycles. This occurs when an existing error is already logged in 
the MCx register and another error is received in a subsequent cycle. 


3.3.3.9.3 SBox Error Events 


Table 3-12 lists the value of the Model code associated with each individual error. The sections following the table 
provide some additional information on a select set of these errors. 


MCX CTL 
Error Class | Bit 
Unsuccessf 
ul 
Completio 


n 


Poisoned 
Data 
7 


Table 3-12. SBox Error Descriptions 


Model 
Code Description SBOX Behaviour 


Received 
Configuration 
Request 
Retry Status 
(CRS) 
Received 
Completer 
Abort (CA) 
Received 
Unsupported 
Request (UR) 


0x0006h 


0x0007h 
0x0008h 
Completion 


0x0040h 
(PD) 
Upstream 0x0009h 
Request 
terminated 
by 
Completion 
Timeout 
(CTO) 


Received 
Poisoned 
Data in 


A Completion with 
Configuration Request Retry 
Status was received for a 
Request from a Ring Agent 


A Completion with Completer 
Abort status was received for a 
Request from a Ring Agent 

A Completion with 
Unsupported Request status 
was received for a Request 
from a Ring Agent 

A Successful Completion (SC) 
with Poisoned Data was 
received for a Request from a 
Ring Agent 


A Completion Timeout was 
detected for a Request from a 
Ring Agent 


All 1's for data 
returned to the 
Ring 


All 1's for data 
returned to the 
Ring 

All 1's for data 
returned to the 
Ring 


Data payload with 
error is returned to 
the Ring 


All 1's for data 
returned to the 
Ring. 
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MCX CTL Model 
Error Class | Bit Code Description SBOX Behaviour 


Illegal 
Access 


Downstream | Ox0020h | PCIE downstream attempt to 
Address use indirect registers to access 
outside of illegal address ranges via the 
User I/O space 

accessible 

Range 


RD: Successful 
Completion (SC) 
with all O's for data 
returned to PCle 
WR: Discard 
transaction. 
Successful 
Completion (SC) 
with no data 
returned to PCle. 


Unclaimed Ox0021h | A Ring Agent Request to an RD: All 1's for data 
Address (UA) unclaimed address was returned to the 
terminated by subtractive Ring. 
decode WR: Request is 
discarded. 


PCle 0x0030h | A PCle correctable error was ERR_COR Message 
Correctable logged by the Endpoint transmitted on 


Error PCle 


PCle 0x0031h | APCle Uncorrectable error was | ERR NONFATAL or 
Uncorrectabl logged by the Endpoint ERR FATAL 
e Error Message 
transmitted on 
PCle 
3.3.3.9.3.1 Timeout Error 


An upstream timeout occurs when the target fails to respond to an Intel® Xeon Phi™ coprocessor-initiated read request 
within a programmable amount of time. The PCIE endpoint keeps track of these outstanding completions and will signal 
the GHOST unit when it is okay to free up the buffers allocated to hold the timed out completion. To ensure that the 
core subsystem within the Intel® Xeon Phi™ coprocessor doesn’t hang while waiting for a read that will never return, the 
SBox generates a dummy completion back to the requesting thread. The payload of this completion (BAD66BAD) clearly 
indicates that the completion is fake. As well as generating an MCA event that is logged in the MCX_STATUS register, a 
portion of the completion header associated with the failing PCle transaction is logged in the MCX_MISC register. 


3.3.3.9.3.2 Unrecognized Transaction Error 


This type of error indicates that a transaction was dropped by the SBox because it was of a type that is not handled by 
Intel® Xeon Phi™ coprocessor. Transactions that fall into this category are usually vendor-specific messages that are not 
recognized by the Intel® Xeon Phi™ coprocessor. 


3.3.3.9.3.3 Illegal Access Error 


An illegal access error indicates that the SBOX was unable to complete the transaction because the destination address 
was not within the legal range. For inbound transactions initiated by PCIE, this can only happen via I/O read and write 
cycles to the indirect address and data ports. If the user specifies an address above or below the range set aside for 
MMIO host visibility, a machine check exception will be generated and key debug information will be logged for 
software inspection. 
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Ring-initiated transactions can also result in an illegal access error if the coprocessor OS or Tag Directory contains flaws 
in the coding or logic. The SBox microarchitecture will ensure that all EXT RD and EXT WR transactions are delivered to 
the endpoint scheduler for inspection. If the destination address (an internal Intel? Xeon Phi'" coprocessor address) 
does not match one of the direct-access ranges set aside for the Flash device or does not match one of the 32 available 
system memory ranges, it will be terminated and a default value returned to the requester. If the ring traffic was routed 
to the SBox in error, this will likely fail all built-in address range checks and will overload the platform as a result. To 


guard against this possibility, the endpoint scheduling logic must explicitly match one of its valid address ranges before 
driving PCI Express link. Outbound traffic that fails this check will result in the following explicit actions: 


e ENT WRtransactions will be discarded, key debug information will be logged and an MCA exception will be 
generated. 

e ENT RDtransactions will complete and return data back to the requestor, key debug information will be logged, 
and an MCA exception will be generated. 


3.3.3.9.3.4 Correctable PCle Fabric Error 


Errors detected in the PCle fabric will generate an MCA error and be logged in the MCX_STATUS register as an event. It 
is the responsibility of the software handler to extract the error event from the PCle standalone agent status registers as 
well as from communication with the PCle host. The SBox does not log any more information on this error than what is 
contained in the MCX status register. These errors are signaled by the assertion of this endpoint interface signal. 


Table 3-13. Correctable PCle Fabric Error Signal 


Signal Name | Width | Description 
The end point has sent a correctable error message to the root complex 


3.3.3.9.3.5 Uncorrectable PCle Fabric Error 


Errors detected in the PCle fabric (GDA) will generate an MCA error and be logged in the MCX STATUS register as an 
event. It is the responsibility of the software handler to extract the error event from the PCle standalone agent status 
registers as well as from communication with the PCle host. The SBox does not log any more information on this error 
than is contained in the MCX status register. These errors are signaled by the assertion of this endpoint interface signal. 


Table 3-14. Uncorrectable PCle Fabric Error Signal 


Signal Name | Width | Description 


funcO rep uncor err | Scalar The end point has sent an uncorrectable error message (fatal or 
nonfatal) to the root complex 


3.3.3.9.4 GBox Error Events 


Table 3-15. GBox Errors 


Error Met CTL 
Category Model Code | Description GBOX Behaviour 


Correctable ECC 0x00000000 | Single bit ECC error | Log/Signal Event 
Error Detected Ch on channel O 
0 


Correctable ECC 0x00000000 | Single bit ECC error 
Error Detected Ch on channel 1 


1 


Uncorrectable ECC xc dm Double bit ECC error | Log/Signal Event 
~ Detected Ch on channel 0 "Corrupted" Data 
may be returned to 


ES ECC | 0x10000000 | Double bit ECC error | consumer. 
Error Detected Ch 
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Error MCX CTL 
Category Bit Model Code | Description GBOX Behaviour 


0x20000000 | Access to reserved 
ECC memory 

0x00000001 | Memory Cape 
threshold Exceeded 
on ChO 

0x00000002 | Memory Cape 
threshold Exceeded 
on ChO 

retraining event 


3 


Illegal Access to 
Reserved ECC 

Memory Space 
CAPE Exceeded 
Threshold Ch 0 


Log/Signal Event 


Training Log/Signal Event 


Channel 1 
retraining 


1 


Training failure Log/Signal Event 


after DM request 
Ch 0 


5 CAPE Exceeded 
Threshold Ch 1 


3 
30 Training failure 
after DM request 


29 
Ch1 
Standalone tag 
directory Proxy 


Proxy MCA 
MCA event 


0x04000000 | Training failure after 
DM request Ch 1 
Zz 
Miscellaneou Transaction 
S Received to an 
Invalid Channel 


x00000010 | MCA event In 
Standalone Tag 
Directory 


0x00000004 | Memory transaction 
with invalid channel 
encountered 


2 


h 
h 
h 
h 
h 
x02000000 | Training failure after 
h DM request Ch 0 
h 
h 
h 
h 


3.3.3.9.5 Tag Directory Error Events 


MXC CTL 
Error Category Bit 


Tag-State UNCORR 
Error 


Table 3-16. TD Errors 


Internal 
Transaction 


Core-Valid 


UNCORR Error WE 


Transaction 0x0011h internal TD transaction 


(i.e. Victim) 


Tag-State CORR 


0 
0 
0 
0 
1 
0x00000000 | Channel 0 retraining 
2 event 
0 
0 
0 
0 


3 Channel 0 Write 0x00080000 | Channel 0 Write 
Queue overflow 0 Queue overflow 
24 Channel 1 Write 0x00100000 | Channel 1 Write 
Queue overflow Oh Queue overflow 


A State error occurred on an 


Transaction halted 


Log/Signal Event 


Log/Signal Event 
"Corrupted" Data 
may be returned to 
consumer/Transactio 
n halted 

Log/Signal Event 
Unspecified 
behaviour 


Model Logging 
Code Description Register 
External 0x0001h A tag error occurred on an external 
Transaction TD transaction 


A tag error occurred on an internal 
0x0002h TD transaction 
(i.e. Victim) 


MC2 STATUS 
MC2 ADDR 
MC2 STATUS 
MC2 ADDR 


External 0x0010h A state error occurred on an MC2 STATUS 
Transaction external TD transaction MC2 ADDR 


MC2 STATUS 
MC2 ADDR 


0x0100h A tag error occurred on an external | MC2 STATUS 
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Mes CTL Model Logging 
Error EEE Code Description Register 


[Error sd |. | Transaction | | — | TD transaction MC2 ADDR 


Internal A tag error occurred on an internal | MC2 STATUS 
Transaction 0x0101h TD transaction MC2_ADDR 
(i.e. Victim) 


Error Transaction external TD transaction MC2_ADDR 
Internal A State error occurred on an MC2 STATUS 
1 Transaction 0x0111h internal TD transaction MC2 ADDR 
(i.e. Victim) 


3.3.3.9.6 Spare Tag Directory (TD) Logging Registers 


The Spare Tag Directory contains a set of registers similar to core bank registers but implemented in MMIO (USR 
register) space instead of the MSR space that co-located TD's are assigned to. 


3.3.3.9.7 Memory Controller (GBox) Error Logging Registers 


The GBox contains a set of registers similar to core bank registers but implemented in MMIO (USR register) space 
instead of as MSRs. The GBox signals two classes of events, CRC retry and Training failure. CRC retry is signaled when the 
GBox attempts a predefined number of retries for a transaction (before initiating retraining). Training failure is signaled 
when the GBox training logic fails or when a transaction incurs a CRC failure after retraining was initiated. 


Table 3-17. GBox Error Registers 


Register Name Size (bits) Description 

MCX CTL LO 32 Machine Check control register 
MCX CTL HI 32 Machine Check control register 
MCX STATUS LO 32 Machine Check status register 

MCX STATUS HI 32 Machine Check status register 

MCX ADDR LO 32 Machine Check address register 
MCX ADDR HI 32 Machine Check address register 
MCX MISC 32 MISC (Transaction timeout register) 


3.3.3.10 Uncore MCA Signaling 


Once a machine-check event has occurred in an uncore agent and has been logged in the error reporting register 

(MCX STATUS), the MCX STATUS.VAL bit is sent from each agent to the SBox interrupt controller, which captures this 
bit in the SBox MCA INT STAT register. Each bit of the SBox MCA INT STAT register represents an MCA/EMON event 
of an uncore agent. When the corresponding bit of MCA INT EN is also set, then the SBox will generate an interrupt to 
the specified Intel? Xeon Phi™ coprocessor core with the interrupt vector specified in the SBox interrupt controller's 
redirection table. 


3.3.4 CacheLine Disable 


A statistically significant number of SRAM cells can develop erratic and sticky bit failures over time. Burn-in can be used 
to reduce these types of errors, but it is not sufficient to guarantee that there is statistically insignificant number of 
these errors as has been the case in the past. These array errors also manifest more readily as a result of the 
requirement for the product to run at low voltage in order to reduce power consumption. The Intel® Xeon Phi™ 
coprocessor operational voltage will need to find the right balance between power and reliable operation in this regard 
and it must be assumed that SRAM arrays on Intel® Xeon Phi™ coprocessor can develop erratic and sticky bit failures. 
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As a result of the statistically significant SRAM array error sources outlined above, the Intel? Xeon Phi™ coprocessor 
supports a mechanism known as Cache Line Disable (CLD) that is used to disable cache lines that develop erratic and 
sticky bit failures. Intel? Xeon Phi™ coprocessor hardware detects these array errors and signals a machine check 
exception to a machine check handler, which implements the error handling policy and which can (optionally) use CLD 
to preclude these errors from occurring in the future. Since the cache line in question will no longer be allowed to 
allocate a new line in the specific array that sourced the error, there may be a slight performance loss. Since the errors 
can be sticky, and therefore persistent, the Intel? Xeon Phi™ coprocessor remembers the CLDs between cold boots and 
reapplies the CLDs as part of the reset process before the cache is enabled. This is done through reset packets that are 
generated in the SBox and delivered to all units with CLD capability. 


3.3.5 Core Disable 


Similar to Cache Line Disable (CLD), core disable enables the software (OS) to disable a segment of the Intel? Xeon Phi'" 
coprocessor. Core disable allows the OS to disable a particular Intel? Xeon Phi™ coprocessor core. 


Core disable is achieved by writing a segment of the flash room with a core disable mask, and then initiating a cold or 
warm reboot. The selected cores will not be enumerated. 


Core Disable is intended to be used when it is determined that a particular core cannot function correctly due to specific 
error events. When this case is detected, the coprocessor OS sends information to the host RAS agent corresponding to 
the affected core. The RAS agent reboots the card into a special processing mode to disable the core, and then resets 
the Intel? Xeon Phi™ coprocessor card. 


On the next reboot, the core disable flash record will be used to disable the selected cores and prevent them from 
becoming visible to the coprocessor OS for future scheduling. There will be no allocation into the CRI associated with 
the disabled core, but the co-located TD will still function to maintain Intel? Xeon Phi"" coprocessor coherency. 


3.3.6 Machine Check Flows 


This section describes the physical sources of machine check events on the Intel? Xeon Phi™ coprocessor and the 
hardware flows associated with them. It also suggests how software handlers should address these machine check 
events. 


3.3.6.1 Intel® Xeon Phi™ Coprocessor Core 


Sources for machine check events are the L1 instruction and L1 data caches and their associated TLB's as well as the 
microcode ROM. 


The L1 instruction cache is protected by parity bits. There is no loss of machine state when a cache parity error occurs. 
MCA's generated due to parity errors are informational only and are corrected in hardware. The intent is for software to 
log the event. 


Both TLB's are protected by parity and contain architectural state. Errors in the TLB's are uncorrected. It is up to the 
software handler to decide if execution should continue. 


The L1 data cache is parity protected, but it does contain modified cache lines that make recovery impossible in every 
case. Also, it does not have any mechanism to convey its machine-check events in a synchronous fault. Hence, 
instructions that encounter parity errors will consume the bad data from the cache. Software must decide if execution 
should continue upon receiving a parity error. The reporting registers provided for this cache allow software to 
invalidate or flush the cache index and way that encountered the parity error. 
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The Cache Ring Interface (CRI) L2 data cache is protected by ECC. While machine checks are delivered asynchronously 
with respect to the instruction accessing the cache, single bit errors are corrected by hardware in-line with the data 
delivery. The L2 tags are protected by parity. 


If data integrity is desired, software should consider a mode where all Intel? Xeon Phi™ coprocessor uncorrected errors 
are treated as fatal errors. To enable potential recovery from L2 cache errors, the address and way of the transaction 
that encounters an error is logged in the Cache Ring Interface. Software may use the address to terminate applications 
that use the affected memory addresses range and flush the affected line from cache. L2 cache errors may occur in the 
Tag array or the data array. Errors in the Tag or data array are typically not corrected and result in incorrect data being 
returned to the application. 


In addition to the error reporting resources, the CRI also contains Cache Line Disable (CLD) registers. There registers are 
programmed on the accumulation of multiple errors to the same cache set and way. Once written, the cache will not 
allow allocations into the specified cache set and way. 


The Intel? Xeon Phi™ coprocessor does not propagate a poison bit with cache-to-cache transfers. Hence the probability 
of a bad line in the L2 propagating without a machine check is significantly higher. On a cache-to-cache transfer for a 
line with bad parity, a machine check is going to be generated on the source L2's core but the data is going to be 
transferred and cached in the requesting L2 as a good line. As part of the MCA logging on a snoop operation, the 
destination of data is logged; this information can be used by the error handler to contain the effect of an L2 error. 


There are two special cases for snoops. The first is a snoop that encounters a Tag/State error that causes a miss in the 
Tag. The second case is a snoop that misses in the tag without a Tag error (or a regular miss). In both cases, the CRI 
should complete the snoop transaction. For snoop types that need a data response, the CRI returns data that may be 
unrelated to the actual requested line. Snoops that incur a miss with a parity error report a TAG MISS UNCORR ERR 
error, but coherency snoops (from TD) that miss generate a SNP MISS UNCORR ERR error. 


The TD Tag-State (TS) and Core-Valid (CV) arrays are protected by ECC. For the Intel? Xeon Phi™ coprocessor all errors 
detected in either the TS or CV arrays may generate a MCA event and are logged in the MCA logging register. Single bit 
errors by the TD are corrected inline and do not change any TD flows for the affected transaction. 


Software must decide if and when to try and recover from a TD error. To remove the error from the TD, software must 
issue WBINVD instructions such that all cores evict all lines from all caches and then evict or flush all possible addresses 
to the same set as the error address to regain coherency in the TDs as it is not obvious which lines are tracked in a 
particular TD. 


The TD allows one Cache-Line-Disable (CLD) register that software can program to disable allocation to a particular TD 
set and way. 


3.3.6.2 Memory Controller (GBox) 


The GBox detects link CRC failures between the PBox and the GDDR devices. These CRC failures are delivered as 
machine-check events to the SBox and are logged in the error reporting registers located in the GBox. In addition to 
CRC, ECC protection of GDDR contents has been added. The GBox can detect single and double bit errors and can 
correct single bit errors. Both single and double bit errors can be enabled to signal machine-check events. 


For a read request that encounters a CRC training failure or a double bit ECC error, the GBox will generate a CRC training 
failure or a double bit ECC error. The GBox will generate a fake completion of the request. On a write the GBox should 
complete the transaction by dropping the write for failing link training or completing the write for a double bit error. 
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3.3.7 Machine Check Handler 


A software machine check handler (implemented as a kernel module in the coprocessor OS) is required to resolve 
hardware machine check events triggered during Intel? Xeon Phi™ coprocessor operation. The machine check handler is 
responsible for logging hardware-corrected events (threshold controlled) and for communicating information to the 
host RAS agent about uncorrected events and logging these events. The host RAS agent determines the correct action 
to take on uncorrected events. 


3.3.7.1 Basic Flow 


Due to the reliability requirements on the Intel? Xeon Phi™ coprocessor and the unique nature of the hardware a 
generic handler will not suffice and an Intel? Xeon Phi™ coprocessor specific handler is required. A machine-check 
handler must perform following steps: 


1. Stop all threads and divert them to the machine check handler via IPI. 

2. Once all threads have reached, the machine check handler skips to step 4. 

3. One or more threads may be hung. Trigger shutdown/checkpoint failure and jump to step 20. 
4. Read MCA registers in each bank and log the information. 

5. If uncorrected error (MCi.STATUS.UC | | MCi.STATUS.PCC), then jump to step 9. 
6. Write CLD field in flash, if necessary. 

7. Ifthe reliability threshold is met, then jump to step 9. 

8. Exit handler. 

9. Turn off all cache and disable MCA (MCi.CTL) for all MC banks. 

10. Perform a WBINV to flush L2 contents. 

11. Invalidate L1 instruction and data caches via test registers. 

12. Turn on the caches, but not MCA. 

13. Ona selected thread/core, perform a read of the entire GDDR memory space. 


14. Perform a WBINV to flush the contents of the selected core. 

15. Clear MCi.STATUS registers for all MC banks. 

16. If reliability testing is not enabled, jump to step 20. 

17. Perform a targeted test of the caches. 

18. Check the contents of the MCi STATUS register for failure (note that MC.STATUS.VAL will not be set). 

19. Iffailure is detected, then set CLD to disable affected lines, and then repeat steps 9-15. 

20. Turn on MCA (enable MCi.CTL). 

21. Assesthe severity of the error and determine the action to be taken (i.e., shutdown application, if possible). 
22. Clear the MCIP bit. 

23. Exit handler. 


3.3.8 Error Injection 


There are three basic methods that system software can use to simulate machine-check events: 
1. Via LDAT DFX register interface. 

2. Dedicated Error Injection Registers. 

3. Machine checks STATUS register. 

3.3.8.1 Error Injection Using LDAT 


Errors can be injected directly into protected arrays via the LDAT DFX interface registers. 
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To trigger an error, modify the contents of the targeted array, corrupting the ECC/parity/Data such that a check 
performed by hardware fails. This is the preferred method to fully test the array protection mechanism and the MC 
logic. The following arrays can be tested by this method: 


e L1 Instruction and L1 data caches 
e Both TLB's 

e L2 data array 

e L2 tag array 

e TD state and CV arrays 


3.3.8.2 Dedicated Error Injection Registers 


Machine check events can be generated using dedicated error injection registers available for a limited number of 
protected arrays. For Intel? Xeon Phi™ coprocessor, this is limited to the MCO and MC1 error reporting banks. 


3.3.8.3 Error Injection via MCi STATUS Register 


The last method of injecting MC events into the machine is via the MCi STATUS register. For the MC1, MC2 and Uncore 
MC Bank registers writing the MCx STATUS.VAL bit will cause a machine check event to be generated from the targeted 
error reporting bank. 


3.3.8.4 List of API's for RAS 


The following interfaces provide communication between RAS features and other parts of the Intel? Xeon Phi™ 
coprocessor software: 


e  SCIF access from exception/interrupt context 

e  SCIF well known ports for the MCA agent and Linux* coprocessor OS MC event handlers 
e  SCIF message formats for MC events reported to host side agent 

e Reboot to maintenance mode via IOCTL request 

e  SCIF message formats for Intel? Xeon Phi™ coprocessor system queries and controls 
e Throttle mechanism for CEs 

e DC driver for the bus where SMC resides 

e DC identifiers for communicating with the SMC 

e Data formats for MC events. 

e Data formats for Intel? Xeon Phi™ coprocessor system queries (if any) 

e Data formats for system environment changes (fan speeds, temp, etc.) 

e Filter for which events to report to SMC 

e Storage location in SMC for MC raw data 

e Fuse override requests to maintenance mode 

e Diagnostic mode entry to maintenance mode 

e Data formats on the RAS log in Intel? Xeon Phi™ coprocessor EEPROM 


Time reference in maintenance mode (Intel® Xeon Phi™ coprocessor cards have no time reference). If the RAS log 
includes the timestamp, the host must provide a time base or a reference to a time counter. 
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4 Operating System Support and Driver Writer's Guide 


This section discusses the support features that the Intel? Xeon Phi™ coprocessor provides for the operating system and 
device drivers. 


4.1 Third Party OS Support 


The Intel? MIC Architecture products support 3" party operating systems such as modified versions of Linux* or 
completely custom designs. The Linux* based coprocessor OS is treated like a 3" party OS. 


4.2 Intel® Xeon Phi™ Coprocessor Limitations for Shrink-Wrapped Operating Systems 


This section is intended to help developers port an existing operating system that runs a platform built around an 
Intel 64 processor to Intel® Xeon Phi™ coprocessor hardware. 


4.2.1 Intel x86 and Intel 64 ABI 
The Intel x86 and Intel 64 -bit ABI uses the SSE2 XMM registers, which do not exist in the Intel? Xeon Phi"" coprocessor. 


4.2.2 PC-AT / I/O Devices 


Because the Intel® Xeon Phi™ coprocessor does not have a PCH southbridge, many of the devices generally assumed to 
exist on a PC platform do not exist. Intel? Xeon Phi™ coprocessor hardware supports a serial console using the serial 
port device on the SBOX I2C bus. It is also possible to export a standard device, like an Ethernet interface, to the OS by 
emulating it over system and GDDR memory shared with the host. This allows for higher level functionality, such as SSH 
or Telnet consoles for interactive and NFS for file access. 


4.2.3 Long Mode Support 


Intel 64 Processors that support Long mode also support a compatibility submode within Long mode to handle existing 
32-bit x86 applications without recompilation. The Intel® Xeon Phi™ coprocessor does not support the compatibility 
submode. 


4.2.4 Custom Local APIC 


The local APIC registers have expanded fields for the APIC ID, Logical APIC ID, and APIC Destination ID. Refer to the SDM 
Volume 3A System Programming Guide for details. 


There is a local APIC (LAPIC) per hardware thread in the Intel? Xeon Phi™ coprocessor. In addition, the SBox contains 
within it a LAPIC that has 8 Interrupt Command Registers (ICRs) to support host-to-coprocessor and inter-coprocessor 
interrupts. To initiate an interrupt from the host to an Intel? Xeon Phi™ coprocessor or from one Intel? Xeon Phi™ 
coprocessor to another, the initiator must write to an ICR on the target Intel? Xeon Phi™ coprocessor. Since there are 8 
ICRs, the system can have up to 8 Intel? Xeon Phi™ coprocessors that can be organized in a mesh topology along with 
the host. 
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4.2.5 Custom I/O APIC 


The Intel? Xeon Phi™ coprocessor I/O APIC has a fixed 64-bit base address. The base address of the I/O APIC on IA 
platforms is communicated to the OS by the BIOS (Bootstrap) via MP, ACPI, or SFI table entries. The MP and ACPI table 
entries use a 32-bit address for the base address of the I/O APIC, whereas the SFI table entry uses a 64-bit address. 
Operating systems that assume a 32-bit address for the I/O APIC will need to be modified. 


The I/O APIC pins (commonly known as irq0, irq1 and so on) on a PC-compatible platform are connected to ISA and PCI 
device interrupts. None of these interrupt sources exist on the Intel? Xeon Phi™ coprocessor; instead the I/O APIC IRQs 
are connected to interrupts generated by the Intel? Xeon Phi™ coprocessor SBox (e.g., DMA channel interrupts, thermal 
interrupts, etc.). 


4.2.6 Timer Hardware 


Timer hardware devices like the programmable interval timer (PIT), the CMOS real time clock (RTC), the advanced 
configuration and power interface (ACPI) timer, and the high-precision event timer (HPET) commonly found on PC 
platforms are absent on the Intel? Xeon Phi'" coprocessor. 


The lack of timer hardware means that the Intel? Xeon Phi™ coprocessor OS must use the LAPIC timer for all 
timekeeping and scheduling activities. It still needs a mechanism to calibrate the LAPIC timer which is otherwise 
calibrated using the PIT. It also needs an alternative solution to the continuously running time-of-day (TOD) clock, which 
keeps time in year/month/day hour:minute:second format. The Intel? Xeon Phi™ coprocessor has a SBox MMIO register 
that provides the current CPU frequency, which can be used to calibrate the LAPIC timer. The TOD clock has to be 
emulated in software to query the host OS for the time at bootup and then using the LAPIC timer interrupt to update it. 
Periodic synchronization with the host may be needed to compensate for timer drift. 


4.2.7 Debug Store 


The Intel® Xeon Phi™ coprocessor does not support the ability to write debug information to a memory resident buffer. 
This feature is used by Branch Trace Store (BTS) and Precise Event Based Sampling (PEBS) facilities. 


4.2.8 Power and Thermal Management 


4.2.8.1 Thermal Monitoring 


Thermal Monitoring of the Intel® Xeon Phi™ coprocessor die is implemented by a Thermal Monitoring Unit (TMU). The 
TMU enforces throttling during thermal events by reducing core frequency ratio. Unlike TM2 thermal monitoring on 
other Intel processors (where thermal events result in throttling of both core frequency and voltage), the Intel® Xeon 
Phi™ coprocessor TMU does not automatically adjust the voltage. The Intel® Xeon Phi™ coprocessor TMU coordinates 
with a software-based mechanism to adjust processor performance states (P-states). The TMU software interface 
consists of a thermal interrupt routed through the SBox I/O APIC and SBox interrupt control and status MMIO registers. 
For more information on the TMU and its software interface refer to the section on Intel® Xeon Phi™ Coprocessor Power 
and Thermal Management. 


4.2.8.2 ACPI Thermal Monitor and Software Controlled Clock Facilities 


The processor implements internal MSRs (IA32 THERM STATUS, IA32 THERM INTERRUPT, 
IA32 CLOCK MODULATION) that allow the processor temperature to be monitored and the processor performance to 
be modulated in predefined duty cycles under software control. 
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The Intel? Xeon Phi™ coprocessor supports non-ACPI based thermal monitoring through a dedicated TMU and a set of 
thermal sensors. Thermal throttling of the core clock occurs automatically in hardware during a thermal event. 
Additionally, OS power-management software is given an opportunity to modulate the core frequency and voltage in 
response to the thermal event. These core frequency and voltage settings take effect when the thermal event ends. In 
other words, Intel? Xeon Phi™ coprocessor hardware provides equivalent support for handling thermal events but 
through different mechanisms. 


4.2.8.2.1 Enhanced SpeedStep (EST) 


ACPI defines performance states (P-state) that are used to facilitate system software’s ability to manage processor 
power consumption. EST allows the software to dynamically change the clock speed of the processor (to different P- 
states). The software makes P-state decisions based on P-state hardware coordination feedback provided by EST. 


Again, the Intel® Xeon Phi™ coprocessor is not ACPI compliant. However, the hardware provides a means for the OS 
power-management software to set core frequency and voltage that corresponds to the setting of P-states in the ACPI 
domain. OS PM software in the Intel® Xeon Phi™ coprocessor (just as in the case of EST) dynamically changes the core 
frequency and voltage of the processor cores based on core utilization, thereby reducing power consumption. 
Additionally, the Intel® Xeon Phi™ coprocessor hardware provides feedback to the software when the changes in 
frequency and voltage take effect. This is roughly equivalent to what exists for EST; except that there is a greater burden 
on OS PM software to: 


e generate a table of frequency/voltage pairs that correspond to P-states 
e set core frequency and voltage to dynamically change P-states. 


4.2.9 Pending Break Enable 


The Intel® Xeon Phi™ coprocessor does not support this feature. 


4.2.10 Global Page Tables 


The Intel® Xeon Phi™ coprocessor does not support the global bit in Page Directory Entries (PDEs) and Page Table Entries 
(PTEs). Operating systems typically detect the presence of this feature using the CPUID instruction. This feature is 
enabled on processors that support it by writing to the PGE bit in CR4. On the Intel® Xeon Phi™ coprocessor, writing to 
this bit results in a GP fault. 


4.2.11 CNXT-ID - L1 Context ID 


Intel? Xeon Phi™ coprocessor does not support this feature. 


4.2.12 Prefetch Instructions 


The Intel? Xeon Phi™ coprocessor's prefetch instructions differ from those available on other Intel? processors that 
support the MMX™ instructions or the Intel? Streaming SIMD Extensions. As a result, the PREFETCH instruction is not 
supported. This set of instructions is replaced by VPREFETCH as described in the (Intel? Xeon Phi™ Coprocessor 
Instruction Set Reference Manual (Reference Number: 327364)). 


4.2.13 PSE-36 


PSE-36 refers to an Intel processor feature (in 32-bit mode) that extends the physical memory addressing capabilities 
from 32 bits to 36 bits. The Intel? Xeon Phi™ coprocessor has 40 bits of physical address space but only supports 32 bits 
of physical address space in 32-bit mode. See also the (Intel? Xeon Phi™ Coprocessor Instruction Set Reference Manual 
(Reference Number: 327364)). 
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4.2.14 PSN (Processor Serial Number) 


The Intel? Xeon Phi™ coprocessor does not support this feature. 


4.2.15 Machine Check Architecture 


The Intel? Xeon Phi™ coprocessor does not support MCA as defined by the Intel® Pentium? Pro and later Intel 
processors. However, MCEs on the Intel? Xeon Phi™ coprocessor are compatible with the Intel® Pentium? processor. 


4.2.16 Virtual Memory Extensions (VMX) 


The Intel? Xeon Phi™ coprocessor does not support the virtualization technology (VT) extensions available on some 
Intel? 64 processors. 


4.2.17 CPUID 


The Intel? Xeon Phi™ coprocessor supports a highest-source operand value (also known as a CPUID leaf) of 4 for CPUID 
basic information, 0x80000008 for extended function information, and 0x20000001 for graphics function information. 


4.2.17.1 Always Running LAPIC Timer 


The LAPIC timer on the Intel? Xeon Phi™ coprocessor keeps ticking even when the Intel? Xeon Phi™ coprocessor core is 
in the C3 state. On other Intel processors, the OS detects the presence of this feature using the CPU ID leaf 6. The Intel? 
Xeon Phi™ coprocessor does not support this leaf so any existing OS code that detects this feature must be modified to 
support the Intel? Xeon Phi™ coprocessor. 


4.2.18 Unsupported Instructions 


For the details on supported and unsupported instructions, please consult the (Intel? Xeon Phi™ Coprocessor Instruction 
Set Reference Manual (Reference Number: 327364)). 


4.2.18.1 Memory Ordering Instructions 


The Intel? Xeon Phi™ coprocessor memory model is the same as that of the Intel? Pentium processor. The reads and 
writes always appear in programmed order at the system bus (or the ring interconnect in the case of the Intel? Xeon 
Phi™ coprocessor); the exception being that read misses are permitted to go ahead of buffered writes on the system bus 
when all the buffered writes are cached hits and are, therefore, not directed to the same address being accessed by the 
read miss. 


As a consequence of its stricter memory ordering model, the Intel? Xeon Phi™ coprocessor does not support the 
SFENCE, LFENCE, and MFENCE instructions that provide a more efficient way of controlling memory ordering on other 
Intel processors. 


While reads and writes from an Intel? Xeon Phi™ coprocessor appear in program order on the system bus, the compiler 
can still reorder unrelated memory operations while maintaining program order on a single Intel? Xeon Phi™ 
coprocessor (hardware thread). If software running on an Intel? Xeon Phi™ coprocessor is dependent on the order of 
memory operations on another Intel? Xeon Phi™ coprocessor then a serializing instruction (e.g., CPUID, instruction with 
a LOCK prefix) between the memory operations is required to guarantee completion of all memory accesses issued prior 
to the serializing instruction before any subsequent memory operations are started. 
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4.2.18.2 Conditional Movs 


Intel? Xeon Phi™ coprocessor does not support the Conditional Mov instructions. The OS can detect the lack of CMOVs 
using CPUID. 


4.2.18.3 IN and OUT 


The Intel? Xeon Phi™ coprocessor does not support IN (IN, INS, INSB, INSW, INSD) and OUT (OUT, OUTS, OUTSB, 
OUTSW, OUTSD) instructions. These instructions result in a GP fault. There is no use for these instructions on Intel? Xeon 
Phi™ coprocessors; all I/O devices are accessed through MMIO registers. 


4.2.18.4 SYSENTER and SYSEXIT 


The Intel? Xeon Phi™ coprocessor does not support the SYSENTER and SYSEXIT instructions that are used by 32-bit Intel 
processors (since the Pentium II) to implement system calls. However, the Intel? Xeon Phi™ coprocessor does support 
the SYSCALL and SYSRET instructions that are supported by Intel 64 processors. Using CPUID, the OS can detect the lack 
of SYSENTER and SYSEXIT and the presence of SYSCALL and SYSRET instructions. 


4.2.18.5 MMX™ Technology and Streaming SIMD Extensions 


The Intel? Xeon Phi™ coprocessor only supports SIMD vector registers that are 512 bits wide (zmm0-31) along with eight 
16-bit wide vector mask registers. 


The IA-32 architecture includes features by which an OS can avoid the time-consuming restoring of the floating- point 
state when activating a user process that does not use the floating-point unit. It does this by setting the TS bit in control 
register CRO. If a user process then tries to use the floating-point unit, a device-not-available fault (exception 7 = #NM) 
occurs. The OS can respond to this by restoring the floating-point state and by clearing CRO.TS, which prevents the fault 
from recurring. 


The Intel? Xeon Phi™ coprocessor does not include any explicit instruction to perform context a save and restore of the 
Intel? Xeon Phi™ coprocessor state. To perform a context save and restore you can use: 


e Vector loads and stores for vector registers 
e Acombination of vkmov plus scalar loads and stores for mask registers 
e LDMXCSR and STMXCSR for MXCSR state register 


4.2.18.6 Monitor and Mwait 


The Intel? Xeon Phi™ coprocessor does not support the MONITOR and MWAIT instructions. The OS can use CPUID to 
detect lack of support for these. 


MONITOR and MWAIT are provided to improve synchronization between multiple agents. In the implementation for the 
Intel® Pentium®4 processor with Streaming SIMD Extensions 3 (SSE3, MONITOR/MWAIT are targeted for use by system 
software to provide more efficient thread synchronization primitives. MONITOR defines an address range used to 
monitor write-back stores. MWAIT is used to indicate that the software thread is waiting for a write-back store to the 
address range defined by the MONITOR instruction. 


FCOMI, FCOMIP, FUCOMI, FUCOM, FCMOVcc 
The Intel? Xeon Phi™ coprocessor does not support these floating-point instructions, which were introduced after the 
Intel? Pentium? processor. 
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4.2.18.7 Pause 


The Intel? Xeon Phi™ coprocessor does not support the pause instruction (introduced in the Intel? Pentium? 4 to 
improve its performance in spin loops and to reduce the power consumed). The Intel? Pentium? 4 and the Intel? Xeon? 
processors implement the PAUSE instruction as a pre-defined delay. The delay is finite and can be zero for some 


processors. The equivalent Intel? Xeon Phi™ coprocessor instruction is DELAY, which has a programmable delay. Refer 
to the programmer's manual for further details. 
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5 Application Programming Interfaces 


5.1 The SCIF APIs 


SCIF provides a mechanism for internode communication within a single platform, where a node is either an Intel? Xeon 
Phi™ coprocessor or the Xeon-based host processor complex. In particular, SCIF abstracts the details of communicating 
over the PCI Express* bus (and controlling related Intel? Xeon Phi™ coprocessors) while providing an API that is 
symmetric between the host and the Intel? Xeon Phi™ coprocessor. An important design objective for SCIF was to 
deliver the maximum possible performance given the communication abilities of the hardware. 


The Intel? MPSS supports a computing model in which the workload is distributed across both the Intel? Xeon? host 
processors and the Intel? Xeon Phi™ coprocessor based add-in PCI Express* cards. An important property of SCIF is 
symmetry; SCIF drivers should present the same interface on both the host processor and the Intel® Xeon Phi™ 
coprocessor so that software written to SCIF can be executed wherever is most appropriate. SCIF architecture is 
operating system independent; that is, SCIF implementations on different operating systems are able to 
intercommunicate. SCIF is also the transport layer that supports MPI, OpenCL*, and networking (TCP/IP). 


This section defines the architecture of the Intel? MIC Symmetric Communications Interface (SCIF). It identifies all 
external interfaces and each internal interface between the major system components. 


The feature sets listed below are interdependent with SCIF. 

" Reliability Availability Serviceability (RAS )Support 
Because SCIF serves as the communication channel between the host and the Intel? Xeon Phi™ coprocessors, it 
is used for RAS communication. 

= Power Management 
SCIF must deal with power state events such as a node entering or leaving package C6. 

= Virtualization Considerations 
The Intel® Xeon Phi™ coprocessor product supports the direct assignment virtualization model. The host 
processor is virtualized, and each Intel® Xeon Phi™ coprocessor device is assigned exclusively to exactly one 
VM. Under this model, each VM and its assigned Intel® Xeon Phi™ coprocessor devices can operate as a SCIF 
network. Each SCIF network is separate from other SCIF networks in that no intercommunication is possible. 

= Multi-card Support 
The SCIF model, in principle, supports an arbitrary number of Intel? Xeon Phi'" coprocessor devices. The SCIF 
implementation is optimized for up to 8 Intel? Xeon Phi™ coprocessor devices. 

= Board Tools 
The Intel? MPSS ships with some software tools commonly referred to as "board tools". Some of these board 
tools are layered on SCIF. 


As SCIF provides the communication capability between host and the Intel? Xeon Phi™ coprocessors, there must be 
implementations of SCIF on both the host and the Intel? Xeon Phi™ coprocessor. Multisocket platforms are supported 
by providing each socketed processor with a physical PCI Express* interface. SCIF supports communication between 
any host processor and any Intel® Xeon Phi™ coprocessor, and between any two Intel? Xeon Phi™ coprocessors 
connected through separate physical PCI buses. 


All of Intel? Xeon Phi™ coprocessor memory can be visible to the host or other Intel? Xeon Phi™ coprocessor devices. 
The upper 512GB of the Intel? Xeon Phi™ coprocessor's physical address space is divided into 32 16-GB ranges that map 
through 32 corresponding SMPT registers to 16-GB ranges in host system address space. Each SMPT register can be 
programmed to any multiple of 16-GB in the host’s 64-bit address space. The Intel® Xeon Phi™ coprocessor accesses the 
host's physical memory through these registers. It also uses these registers to access the memory space of other Intel? 
Xeon Phi™ coprocessor devices for peer-to-peer communication since Intel? Xeon Phi™ coprocessor memory is mapped 
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into the host address space. Thus, there is an upper limit of 512 GB to the host system memory space that can be 
addressed at any time. Up to seven SMPT registers (112 GB of this aperture) are needed to access the memory of seven 
other Intel? Xeon Phi™ coprocessor devices in a platform, for a maximum of 8 Intel? Xeon Phi™ coprocessor devices 
(assuming up to 16 GB per Intel? Xeon Phi™ coprocessor device). This leaves 25 SMPTs, which can map up to 400GB of 
host memory. Overall, as the number of Intel? Xeon Phi™ coprocessor devices within a platform increases, the amount 
of host memory that is visible to each Intel? Xeon Phi™ coprocessor device decreases. 
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Figure 5-1. SCIP Architectural Model 


Note that although SCIF supports peer-to-peer reads, the PCle* root complex of some Intel client platforms do not. 
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The Intel? Xeon Phi™ coprocessor DMA engine begins DMAs on cache-line boundaries, and the DMA length is some 
multiple of the cache-line length (64B). Many applications need finer granularity. SCIF uses various software techniques 
to work compensate for this limitation. For example, when the source and destination base addresses are separated by 
a multiple of 64B, but do not begin on a cache-line boundary, the transfer is performed as unaligned "head" and "tail" 
read and write transfers (by the Intel? Xeon Phi™ coprocessor cores) and an aligned DMA “body” transfer. When the 
source and destination base addresses are not separated by a multiple of 64B, SCIF may first perform a local memory-to- 
memory copy of the buffer, followed by the head/body/tail transfer. 


A SCIF implementation on a host or Intel? Xeon Phi™ coprocessor device includes both a user mode (Ring 3) library and 
kernel mode (Ring 0) driver. The user mode (Ring 3) library and kernel mode (Ring 0) driver implementations are 
designed to maximize portability across devices and operating systems. A kernel mode library facilitates accessing SCIF 
facilities from kernel mode. Subsequent subsections briefly describe the major components layered on SCIF. 


The kernel-mode SCIF API is similar to the user mode API and is documented in the /nte/? MIC SCIF API Reference 

Manual for Kernel Mode Linux*. Table 5-1 is a snapshot summary of the SCIF APIs. In the table, uSCIF indicates a function 
in the user mode API, and kSCIF indicates a function in the kernel mode API. For complete details of the SCIF API and 
architectural design, please consult the Intel* MIC SCIF API Reference Manual for User Mode Linux*. 


Table 5-1 Summary of SCIF Functions 


Connection scif open USCIF/KSCIF 
scif close USCIF/KSCIF 
scif bind USCIF/KSCIF 
scif listen USCIF/KSCIF 
scif connect USCIF/KSCIF 
scif accept USCIF/KSCIF 

scif send USCIF/KSCIF 
SCH recv USCIF/KSCIF 

Nate scif register USCIF/KSCIF 

MNYETI IWA scif unregister USCIF/KSCIF 
SCH mmap USCIF 
scif munmap uSCIF 
SCH pin pages kSCIF 
scif unpin pages kSCIF 
scif register pinned pages  kSCIF 
scif get pages kSCIF 
scif put pages kSCIF 
scif readfrom USCIF/KSCIF 
SCH writeto USCIF/KSCIF 
scif vreadfrom USCIF/KSCIF 
SCH vwriteto USCIF/KSCIF 
scif fence mark p SCIF/kSCIF 
scif fence wait USCIF/KSCIF 
scif fence signal USCIF/KSCIF 
scif event register kSCIF 
scif poll USCIF/KSCIF 
scif get nodelDs USCIF/KSCIF 
scif get fd USCIF 
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The Connection API group enables establishing connections between processes on different nodes in the SCIF network, 
and employs a socket-like connection procedure. Such connections are point-to-point, connecting a pair of processes, 
and are the context in which communication between processes is performed. 


The Messaging API group supports two-sided communication between connected processes and is intended for the 
exchange of short, latency-sensitive messages such as commands and synchronization operations. 


The Registration API group enables controlled access to ranges of the memory of one process by a process to which it is 
connected. This group includes APIs for mapping the registered memory of a process in the address space of another 
process. 


The RMA API group supports one-sided communication between the registered memories of connected processes, and 
is intended for the transfer of medium to large buffers. Both DMA and programmed I/O are supported by this group. 
The RMA API group also supports synchronization to the completion of previously initiated RMAs. 


Utility APIs provide a number of utility services. 


5.2  MicAccessAPI 


The MicAccessAPI is a C/C++ library that exposes a set of APIs for applications to monitor and configure several metrics 
of the Intel? Xeon Phi'" coprocessor platform. It also allows communication with other agents, such as the System 
Management Controller if it is present on the card. This library is in turn dependent on libscif.so. This library is required 
in order to be able to connect to and communicate with the kernel components of the software stack. The libscif.so 
library is installed as part of Intel® MPSS. Several tools, including Miclnfo, MicCheck, MicSmc & MicFlash all of which are 
located in /opt/intel/mic/bin after installing MPSS, rely heavily on MicAccessAPI. 


Following a successful boot of the Intel? Xeon Phi™ coprocessor card(s), the primary responsibility of MicAccessAPI is to 
establish connections with the host driver and the coprocessor OS, and subsequently allow software to 
monitor/configure Intel? Xeon Phi™ coprocessor parameters. The host application and coprocessor OS communicate 
using messages, which are sent via the underlying SCIF architecture using the Sysfs mechanism as indicated in the figure 
below. 


Page 122 


| ( MicFlash » (MicSMC ` (MicCheck ) (MICInfo ` 


User Application 


MicAccessAPI 


Coprocessor OS Monitoring 


Thread 
SysMgmt SCIF 
Interface , 


| 
Coprocessor OS 
sysfs interface 


User SCIF 


loCTL 


PClAccess/Linux 
C MemMap Host SCIF Driver Coprocessor SCIF Driver 
p" 
K d Ke l 
PCle Bus ) 
HOST MIC-Device 


Figure 5-2 Intel? Xeon Phi™ Coprocessor SysMgmt MicAccessAPI Architecture Components Diagram 


Another important responsibility of MicAccessAPI is to update the Flash & SMC. In order to be able to perform this 
update, the Intel? Xeon Phi™ coprocessor cards must be in the ‘ready’ mode. This can be accomplished using the 
‘micctrl’ tool that comes with MPSS. The MicAccessAPI then enters into maintenance mode and interacts with the SPI 


Flash and the SMC's flash components via the maintenance mode handler to successfully complete the update process 
as shown in the figure below. 
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Figure 5-3 MicAccessAPI Flash Update Procedure 


The various APIs included in the MicAccessAPI library can be classified into several broad categories as shown in Table 


5-2. 
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Table 5-2. MicAccessAPI Library APIs 


[Group 3 API Name 
Initialization MiclnitAPI, MicCloseAPI, MiclnitAdapter, MicCloseAadpter 


MicGetFlashDevicelnfo, MicGetFlashinfo, 
MicGetFlashVersion, MicUpdateFlash , MicSaveFlash, 
MicWriteFlash, MicAbortFlashUpdate, MicDiffFlash, 
MicFlashCompatibility, MicGetMicFlashStatus 
Power management MicPowerManagementStatus, MicGetPowerUsage, 
MicGetPowerLimit, MicPowerManagementEnable, 
MicResetMaxPower 

MicGetSMCFWVersion, MicGetSMCHWhevision, 

MicGetUUID, MicLedAlert 

MicGetFanStatus, MicSetFanStatus, MicGetTemperature, 


Thermal 
MicGetFSCInfo, MicGetFrequency, MicGetVoltage 


Memory MicGetDevMemlnfo, MicGetGDDRMemSize, 
MicGetMemoryUtilization, MicMapMemory, 
MicUnmapMemory, MicReadMem, MicWriteMem, 
MicReadMemPhysical , MicWriteMemPhysical, 
MicCopyGDDRToFile 

MicGetPcieLinkSpeed, MicGetPcieLinkWidth, 
MicGetPcieMaxPayload, MicGetPcieMaxReadReq 
Core MicGetCoreUtilization, MicGetNumCores 


Turbo & ECC MicGetTurboMode, MicDeviceSupportsTurboMode, 
MicEnableTurboMode, MicDisableTurboMode, 
MicGetEccMode, MicEnableEcc, MicDisableEcc 
MicExceptionsEnableAPI, MicExceptionsDisableAPI, 
MicThrowException 


General Card Information MicGetDevicelD, MicGetPostCode, MicGetProcessorlnfo, 
MicGetRevisionID, MicGetSiSKU, MicGetSteppinglD, 
MicGetSubSystemID, MicCheckUOSDownloaded, 
MicGetMicVersion, MicGetUsageMode, MicSetUsageMode, 
MicCardReset 


5.3 Support for Industry Standards 


The Intel? MPSS supports industry standards like OpenMP"", OpenCL*, MPI, OFED*, and TCP/IP. 
OpenMP™ is supported as part of the Intel? Composer XE software tools suite for the Intel® MIC Architecture. 


MPI standards are supported through OFED* verbs development. See Section 2.2.9.2 for OFED* support offered in the 
Intel? Xeon Phi™ coprocessor. 


The support for the OpenCL* standard for programming heterogeneous computers consists of three components: 


e Platform APIs used by a host program to enumerate compute resources and their capabilities. 

e A set of Runtime APIs to control compute resources in a platform independent manner. The Runtime APIs are 
responsible for memory allocations, copies, and launching kernels; and provide an event mechanism that allows 
the host to query the status of or wait for the completion of a given call. 

e AC-based programming language for writing programs for the compute devices. 
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For more information, consult the relevant specification published by the respective owning organizations: 


e OpenMP"" ( http://openmp.org/ ) 


e OpenCL ( http://www.khronos.org/opencl/ ) 


e = MPI(http://www.mpi-forum.org/ ) 
e OFED* Overview ( http://www.openfabrics.org ) 


5.3.1 TCP/IP Emulation 


The NetDev drivers emulate an Ethernet device to the next higher layer (IP layer) of the networking stack. Drivers have 
been developed specifically for the Linux* and Windows* operating systems. The host can be configured to bridge the 
TCP/IP network (created by the NetDev drivers) to other networks that the host is connected to. The availability of such 


a TCP/IP capability enables, among other things: 
e remote access to Intel? Xeon Phi™ coprocessor devices via Telnet or SSH 
e access to MPI on TCP/IP (as an alternative to MPI on OFED*) 
e NFS access to the host or remote file systems (see Section 0). 


5.4 Intel® Xeon Phi™ Coprocessor Command Utilities 


Table 5-3 describes the utilities that are available to move data or execute commands or applications from the host to 


the Intel? Xeon Phi™ coprocessors. 
Table 5-3. Intel? Xeon Phi™ Coprocessor Command Utilities 


Utility Description 

micctrl = This utility administers various Intel? Xeon Phi™ duties including initialization, resetting 
and changing/setting the modes of any coprocessors installed on the platform. 

" Seethe Intel? Xeon Phi'" Manycore Platform Software Stack (MPSS) Getting Started 
Guide (document number 513523) for details on how to use this tool. 

micnativeloadex Uploads an executable and any dependent libraries: 

= from the host to a specified Intel? Xeon Phi™ coprocessor device 

= from one Intel? Xeon Phi™ coprocessor device back to the host 

= from one Intel? Xeon Phi™ coprocessor device to another Intel? Xeon Phi™ 

coprocessor device. 

A process is created on the target device to execute the code. The application 

micnativeloadex can redirect (proxy) the process's file I/O to or from a device on the host. 

See the Intel? Xeon Phi™ Manycore Platform Software Stack (MPSS) Getting Started Guide 

(document number 513523) for details on how to use this tool. 


5.5 NetDev Virtual Networking 


5.5.1 Introduction 


The Linux* networking (see Figure 5-4) stack is made up of many layers. The application layer at the top consists of 
entities that typically run in ring3 (e.g., FTP client, Telnet, etc.) but can include support from components that run in 
ringO. The ring3 components typically use the services of the protocol layers via a system call interface like sockets. The 
device agnostic transport layer consists of several protocols including the two most common ones — TCP and UDP. The 
transport layer is responsible for maintaining peer-to-peer communications between two endpoints (commonly 
identified by ports) on the same or on different nodes. The Network layer (layer 3) includes protocols such as IP, ICMP, 
and ARP; and is responsible for maintaining communication between nodes, including making routing decisions. The Link 
layer (layer 2) consists of a number of protocol agnostic device drivers that provide access to the Physical layer for a 
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number of different mediums such as Ethernet or serial links. In the Linux* network driver model, the Network layer 
talks to the device drivers via an indirection level that provides a common interface for access to various mediums. 


Application Layer: FTP client, sockets 


Transport Layer: TCP/UDP 


Network Layer: IP 


Figure 5-4 Linux* Network Stack 


The focus of this section is to describe the virtual Ethernet driver that is used to communicate between various nodes in 
the system, including between cards. The virtual Ethernet driver sends and receives Ethernet frames across the PCI 
Express* bus and uses the DMA capability provided by the SBox on each card. 


5.5.2 Implementation 


A separate Linux* interface is created for each Intel? Xeon Phi™ coprocessor (micO, mic1, and so on). It emulates a 
Linux* hardware network driver underneath the network stack on both ends. Currently, the connections are class C 
subnets local to the host system only. In the future, the class C subnets will be made available under the Linux* network 
bridging system for outside of host access. 


During initialization, the following steps are followed: 


Descriptor ring is created in host memory. 

Host provides receive buffer space in the descriptor ring using Linux* skbuffs 

Card maps to the host descriptor ring. 

During host transmit, the host posts transmit skbuffs to the card OS in descriptor ring. 
Card polls for changes in descriptor host transmit ring 

Card allocates skbuff and copies host transmit data 

Card sends new skbuff to card side TCP/IP stack. 

At card transmit, card copies transmit skbuff to receive buffer provided at initialization. 
Card increments descriptor pointer. 

Host polls for changes in transmit ring. 

Host sends receive buffer to TCP/IP stack. 


Oo ONS p^ 
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In a future implementation, during initialization, Host will create a descriptor ring for controlling transfers, Host will 
allocate and post a number of receive buffers to the card, card will allocate and post a number of receive buffers to the 
host. At Host Transmit, Host DMAs data to receive skbuff posted by Intel® Xeon Phi™ coprocessor, Host interrupts card, 
Card interrupt routine sends skbuff to tcp/ip stack, card allocates and posts new empty buffer for host use. At Card 
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Transmit, Card DMAs data to receive skbuff posted by Host, Card interrupts host, Host interrupt routine sends skbuff to 
tcp/ip stack, Host allocates and posts new empty buffer for card use. 
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6 Compute Modes and Usage Models 


The architecture of the Intel? Xeon Phi™ coprocessor enables a wide continuum of compute paradigms far beyond what 
is currently available. This flexibility allows a dynamic range of solution to address your computing needs — from highly 
scalar processing to highly parallel processing, and a combination of both in between. There are three general categories 
of compute modes supported with the Intel? Xeon Phi™ coprocessor, which can be combined to develop applications 
that are optimal for the problem at hand. 


6.1 Usage Models 


The following two diagrams illustrate the compute spectrum enabled and supported by the Intel? Xeon? processor- 
Intel? Xeon Phi™ coprocessor coupling. Depending on the application's compute needs, a portion of its compute 
processes can either be processed by the Intel? Xeon? processor host CPUs or by the Intel? Xeon Phi™ coprocessor. The 
application can also be started or hosted by either the Intel? Xeon? processor host or by the Intel? Xeon Phi™ 
coprocessor. Depending on the computational load, an application will run within the range of this spectrum for optimal 
performance. 


Xeon" Scalar Parallel MIC 
hosted ` Co-processing Symmetric ` Co.processing hosted 


General purpose Codes with Highly-parallel 


seri jus ae Ge balanced needs codes 


Codes with highly- Highly parallel codes 
parallel phases with scalar phases 


Figure 6-1 : A Scalar/Parallel Code Viewpoint of the Intel? MIC Architecture Enabled Compute Continuum 


Xeon* native Xeon® hosted Autonomous MIC-hosted MIC native 
MIC co-processed Mode Xeon* co-processed 


Foo( ) 


Figure 6-2: A Process Viewpoint of the Intel? MIC Architecture Enabled Compute Continuum 
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6.2 MPI Programming Models 


The Intel? MPI Library for Intel® MIC Architecture plans to provide all of the traditional Intel? MPI Library features on 
any combination of the Intel® Xeon? and the Intel? Xeon Phi™ coprocessors. The intention is to extend the set of 
architectures supported by the Intel? MPI Library for the Linux* OS, thus providing a uniform program development and 
execution environment across all supported platforms. 


The Intel? MPI Library for Linux* OS is a multi-fabric message-passing library based on ANL* MPICH2* and OSU* 
MVAPICH2*. The Intel® MPI Library for Linux* OS implements the Message Passing Interface, version 2.1* (MPI-2.1) 


specification. 


The Intel? MPI Library for Intel? MIC Architecture supports the programming models shown in Figure 6-3. 


eIntel® MIC Architecture or host CPU eMPI ranks on several co- 
as an accelerator processors and/or host nodes 


«Messages to/from any core 


ZN 


E de See Co-processor-onl Symmetric 
(direct acceleration) (reverse acceleration) p Y y 


e MPI ranks on the e MPI ranks on the MIC e MPI ranks on the e MPI ranks on the 
host CPU only CPU only MIC CPU only MIC and host CPUs 

* Messages into/out of | «Messages into/out of e Messages into/out | e Messages into/out of 
the host CPU the MIC CPU ofthe MIC CPU the MIC and host 

e Intel® MIC e Host CPU as an c/o host CPUs CPUs 


Architecture as an accelerator e Threading possible | e Threading possible 
accelerator 


Figure 6-3: MPI Programming Models for the Intel? Xeon Phi™ Coprocessor 


In the Offload mode, either the Intel? Xeon Phi™ coprocessors or the host CPUs are considered to be coprocessors. 
There are two possible scenarios: 


1. Xeon® hosted with Intel? Xeon Phi™ coprocessors, where the MPI processors run on the host Xeon? CPUs, while 
the offload is directed towards the Intel? Xeon Phi™ coprocessors. This model is supported by the Intel® MPI 
Library for Linux* OS as of version 4.0. Update 3. 

2. Intel? Xeon Phi™ coprocessor hosted with Xeon? coprocessing, where the MPI processes run on the Intel? Xeon 
Phi™ coprocessors while the offload is directed to the host Xeon? CPU. 


Both models make use of the offload capabilities of the products like Intel? C, C++, Fortran Compiler for Intel? MIC 
Architecture, and Intel? Math Kernel Library (MKL). The second scenario is not supported currently due to absence of 
the respective offload capabilities in the aforementioned collateral products. 


In the MPI mode, the host Xeon? CPUs and the Intel? Xeon Phi™ coprocessors are considered to be peer nodes, so that 
the MPI processes may reside on both or either of the host Xeon® CPUs and Intel? Xeon Phi™ coprocessors in any 
combination. There are three major models: 


e Symmetric model 
The MPI processes reside on both the host and the Intel? Xeon Phi™ coprocessors. This is the most general MPI 
view of an essentially heterogeneous cluster. 
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e  Coprocessor-only model 
All MPI processes reside only on the Intel? Xeon Phi™ coprocessors. This can be seen as a specific case of the 
symmetric model previously described. Also, this model has a certain affinity to the Intel? Xeon Phi™ 
coprocessor hosted with Xeon? coprocessing model because the host CPUs may, in principle, be used for offload 
tasks. 

e  Host-only model (not depicted) 
All MPI processes reside on the host CPUs and the presence of the Intel? Xeon Phi'" coprocessors is basically 
ignored. Again, this is a specific case of the symmetric model. It has certain affinity to the Xeon? hosted with 
Intel? MIC Architecture model, since the Intel? Xeon Phi™ coprocessors can in principle be used for offload. This 
model is already supported by the Intel MPI Library as of version 4.0.3. 


6.2.1 Offload Model 


This model is characterized by the MPI communications taking place only between the host processors. The 
coprocessors are used exclusively thru the offload capabilities of the products like Intel® C, C++, and Fortran Compiler 
for Intel® MIC Architecture, Intel® Math Kernel Library (MKL), etc. This mode of operation is already supported by the 
Intel® MPI Library for Linux* OS as of version 4.0. Update 3. Using MPI calls inside offloaded code is not supported. 


It should be noted that the total size of the offload code and data is limited to 85% of the amount of GDDR memory on 


the coprocessor. 
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Figure 6-4. MPI on Host Devices with Offload to Coprocessors 
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6.2.2 Coprocessor-Only Model 


In this model (also known as the "Intel? MIC architecture native" model), the MPI processes reside solely inside the 


coprocessor. MPI libraries, the application, and other needed libraries are uploaded to the coprocessors. Then an 
application can be launched from the host or from the coprocessor. 
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Figure 6-5: MPI on the Intel® Xeon Phi™ coprocessors Only 


6.2.3 Symmetric Model 


In this model, the host CPUs and the coprocessors are involved in the execution of the MPI processes and the related 
MPI communications. Message passing is supported inside the coprocessor, inside the host node, and between the 
coprocessor and the host via the shm and shm:tcp fabrics. The shm:tcp fabric is chosen by default; however, using shm 
for communication between the coprocessor and the host provides better MPI performance than TCP. To enable shm 
for internode communication, set the environment variable: I_MPI_SSHM_SCIF={enable |yes|on|1}. 
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Figure 6-6: MPI Processes on Both the Intel® Xeon® Nodes and the Intel® MIC Architecture Devices 


The following is an example of the symmetric model: 


Symmetric model: 

mpiexec.hydra is started on host, 
launches 4 processes on host with 4 threads in each process, 

and 2 processes on "mic0” coprocessor with 16 threads in each process 


(host) Smpiexec.hydra -host $(hostname)-n 4 -env OMP NUM THREADS 4 ./test.exe.host: \ 
-host mic0 -n 2 -env OMP NUM THREADS 16 -wdir /tmp /tmp/test.exe.mic 


6.2.4 Feature Summary 


The Intel? MPI Library requires the presence of the /dev/shm device in the system. To avoid failures related to the 
inability to create a shared memory segment, the /dev/shm device must be set up correctly. 


Message passing is supported inside the coprocessor, inside the host node, between the coprocessors, and between the 
coprocessor and the host via the shm and shm:tcp fabrics. The shm:tcp fabric is chosen by default. 


The Intel? MPI Library pins processes automatically. The environment variable | MPI PIN and related variables are used 
to control process pinning. The number of the MPI processes is limited only by the available resources. The memory 
limitation may manifest itself as an 'Iseek' or ‘cannot register the bufs' error in an MPI application. The environment 
variable | MPI SSHM BUFFER SIZE set to a value smaller than 64 KB may work around this issue. 


The current release of the Intel? MPI Library for Intel? MIC Architecture for Linux* OS does not support certain parts of 
the MPI-2.1 standard specification: 


e Dynamic process management 
e  MPI file I/O 
e Passive target one-sided communication when the target process does not call any MPI functions 
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The current release of the Intel? MPI Library for Intel? MIC Architecture for Linux* OS also does not support certain 
features of the Intel? MPI Library 4.0 Update 3 for Linux* OS: 


e  |LP64 mode 

e gcc support 

e  |PM Statistic 

e Automatic Tuning Utility 
e Fault Tolerance 

e  mpiexec —perhost option 


6.2.5 MPI Application Compilation and Execution 


The typical steps of compiling an MPI application and executing it using mpiexec.hydra are canonically shown in Figure 
6-7. 


$ mpiicc [-mmic] app.c -o app[.mic] 
Build Intel® 64 and Intel® MIC Architecture binaries by using the resp. 
compilers targeting Intel* 64 and Intel* MIC Architecture. 


Upload the binaries to Intel® MIC Architecture (unless NFS mounted). 


$ mpiexec.hydra -n 40 -f hostfile app 
Run 40 instances of application on different mixed nodes. 


Figure 6-7. Compiling and Executing a MPI Application 


For detailed information about installing and running Intel? MPI Library for Intel? MIC Architecture with the Intel? Xeon 
Phi™ coprocessors, please see the Intel® Xeon Phi™ Coprocessor DEVELOPER'S QUICK START GUIDE. 
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7 Intel® Xeon Phi™ Coprocessor Vector Architecture 


7.1 Overview 


The Intel? Xeon Phi™ coprocessor includes a new vector processing unit (VPU) with a new SIMD instruction set. These 
new instructions do not support prior vector architecture models like MMX™, Intel® SSE, or Intel? AVX. 


The Intel? Xeon Phi™ coprocessor VPU is a 16-wide floating-point/integer SIMD engine. It is designed to operate 
efficiently on SOA (Structures of Array) data, i.e. [xO, x1, x2, x3, ..., x15], [yO, y1, y2, y3, ..., y15], [z0, 71, 72, z3, ..., 215], 
and [wO, w1, w2, w3, ..., w15] as opposed to[x0, yO, z0, wO], [x1, y1, z1, w1], [x2, y2, z2, w2], [x3, y3, z3, w3], ..., [x15, 
y15, 215, w15]. 


7.2 Vector State 


The VPU brings with it a new architectural state, comprising vector general registers, vector mask registers, and a status 
register known as VXCSR, which mimics the Intel® SSE status register MXCSR in behavior. The new VPU architectural 
state is replicated four times, once for each hardware context in each core. The Intel? Xeon Phi™ coprocessor introduces 
32 new vector registers, zmmO through zmm31. Each vector register is sixty-four bytes wide (512 bits). The primary use 
of a vector register is to operate on a collection of sixteen 32-bit elements, or eight 64-bit elements. Figure 7-1shows the 
new architectural state associated with the VPU. 


512 bits ———— —— — ——» 16 bits 
vo [ko | 
Ki Vector Mask 
vi [x2] registers 
V2 K3 
v3 el 
LKS | 

: Vector registers K6 

ber 
V28 
V29 
Lë Extended VXCSR 
V31 


Figure 7-1: VPU Registers 


The Intel® Xeon Phi™ coprocessor also introduces eight vector mask registers, denoted with a k prefix; that is, kO 
through k7. Each mask register is sixteen bits wide and is used in a variety of ways, including write masking, 
carry/borrow flags, comparison results, and more. 


7.3 VPU Basic Functions 


The VPU is a SIMD vector engine that is fully pipelined and that executes Intel® Xeon Phi™ coprocessor vector 
instructions at a throughput of one instruction per cycle for most operations (If there are more than 2 threads working 
on a core, the decoder cannot take an instruction from one thread at each cycle because of back-to-back issues.). Most 
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VPU instructions use fully pipelined Multiply-Add (MADD) resources and get issued from the core U-pipe. Vector stores 
(except for scatter), mask operations, and a subset of integer instructions have duplicated or dedicated resources. Most 
instructions can also be issued from the core V-pipe to be paired with vector instructions issued down the U-pipe for 
execution in the VPU. 


The VPU does not have any logic to handle stall or flush conditions, but it does contain logic to check for vector-register 
and mask-register dependencies. Dependency detection logic in the VPU sends a stall request to the core if the operands 
are not data-ready (freeze). Most vector instructions go through the vector pipeline exactly once, and its execution is 
completed in 6 cycles, namely VC1, VC2, V1, V2, V3, and V4. 


The VPU reads/writes the DCache at cache-line granularity via a dedicated 512-bit bus; this allows the VPU to perform 
one read from and one write to memory per clock cycle. Reads from DCache go through load conversion and generic 
swizzling mux before getting sent to the ALUs. Writes go through store conversion and alignments before getting sent to 
the Store Commit Buffers which reside in the DCache. For register operands, the VPU uses data stored in the vector 
register file exclusively. 


The major structures in the VPU are the Vector Register File (VRF), the Flag Registers (MRF), the Swizzle 
Muxes (SWZ), the SRGB Lookup Table (SLUT), and the Trans Lookup Table (TLUT). 


7.4 VPU Data Types 


The VPU contains 16 SP ALU and 8 DP ALU. The VPU contains a 32-bit mantissa data-path for SP ALU, a 54-bit mantissa 
data-path for DP ALU, and a 512-bit data-path for loads and stores. Data types are converted to and from 32-bit or 64- 
bit representation before and after execution respectively. 


The VPU instructions support the following native data types: 


e Packed 32-bit Integers (or dword) 
Packed 32-bit SP FP values 
e Packed 64-bit Integers (or qword) 
e Packed 64-bit DP FP values 


The VPU instructions can be categorized into typeless 32-bit instructions (denoted by postfix "d"), 

typeless 64-bit instructions (denoted by postfix "q"), signed and unsigned int32 instructions (denoted by postfix "pi" and 
"pu" respectively), signed int64 instructions (denoted by postfix “pq”), and fp32 and fp64 instructions (denoted by 
postfix "ps" and "pd" respectively). 


For arithmetic calculations, the VPU represents values internally using 32-bit or 64-bit two's complement plus a sign bit 
(duplicate of the MSB) for signed integers and 32-bit or 64-bit plus a sign bit tied to zero for unsigned integers. This 
simplifies the integer datapath and eliminates the need to implement multiple paths for the integer arithmetic. The VPU 
represents floating-point values internally using signed-magnitude with exponent bias of 128 or 1024 to adhere to the 
IEEE basic single-precision or double-precision format. 


The VPU supports the up-conversion/down-conversion of the data types to/from either 32-bit or 64-bit values in order 
to execute instructions in SP ALU or DP ALU (see Table 7-1). 
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Table 7-1 Bidirectional Up/Down Conversion Table 


Float32 Float64 


SRGB8 Yes No 
Float10 Yes No 
Float11 Yes No 
Float16 Yes No 
Unorm2 Yes No 
Unorm10 Yes No 
Int32 Yes Yes 
Uint32 Yes Yes 
Float64 Yes No 


7.5 Extended Math Unit 


The Extended Math Unit is added to provide hardware with transcendental fast approximations using Lookup Tables. 
Minimax quadratic polynomial approximation is used to compute single-precision FP transcendental functions. Lookup 
Tables are used to store coefficients CO, C1, and C2, to add a special truncated squarer, and to modify the existing 
Wallace tree in the multiplier to accommodate the calculation of each approximation. The goal of the EMU is to provide 
1-cycle or 2-cycle throughput transcendental functions. Specifically, the hardware will provide elementary functions: 
reciprocal (1/X, recip), reciprocal square root (1/VX, rsqrt), base 2 exponential (2^X, exp2), and logarithm base 2 (log2). 
Other transcendental functions can be derived from elementary functions: division (div) using recip and multiplication 
(mult), square root (sqrt) using rsqrt and mult, exponential base 10 (exp) using exp2 and mult, logarithm base B (logB) 
using log2 and mult, and power (pow) using log2, mult, and exp2. Table 7-2 shows the projected latency (in terms of 
number of instructions) of elementary and derived functions. 


Table 7-2 Throughput Cycle of Transcendental Functions 


Elementary Derived 

Functions Latency | Functions Latency 
RECIP 1 DIV 2 
RSQRT 1 SQRT 2 
EXP2 2 POW 4 
LOG2 1 


The EMU is a fully-pipelined design. Look-up Table access happens in parallel with squarer computation. Polynomial 
terms are computed and accumulate in a single pass through the Wallace tree, without resource contention. Exponent 
goes through special setup logic, and then is combined with mantissa path polynomial evaluation. The resulting 
exponent and mantissa then flow through rest of the logic for normalization and rounding. 


7.6 SP FP Operations 


IEEE754r requires basic arithmetic operations (add, subtract, multiply) to be accurate to 0.5 ULP (Unit in the Last Place). 
The Intel® Xeon Phi™ coprocessor SP FP hardware achieves 0.5 ULP for SP FP add, subtract, and multiply operations. 


OpenCL* requires per-instruction rounding mode. The Intel® Xeon Phi™ coprocessor provides support for per-instruction 
rounding mode only for register-register arithmetic instructions. This is accomplished by setting the NT bit and using on- 
the-fly swizzle control bits as per-instruction rounding mode control bits. For details please see the (Intel® Xeon Phi™ 
Coprocessor Instruction Set Reference Manual (Reference Number: 327364)). 
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DX10/11 requires basic arithmetic operations (add, subtract, multiply) to be accurate to 1 ULP and, as previously 
indicated, the Intel? Xeon Phi™ coprocessor SP FP hardware is capable of providing results that are accurate to 0.5 ULP 
for SP FP add, subtract, multiply, and FMA. 


7.7 DP FP Operations 


IEEE754r requires basic arithmetic operations (add, subtract, multiply) to be accurate to 0.5 ULP. Intel? Xeon Phi™ 
coprocessor DP FP hardware is capable of providing results that are accurate to 0.5 ULP for DP FP add, sub, and multiply. 
IEEE754r requires four rounding modes, namely, "Round to Nearest Even”, "Round toward 0”, "Round toward +°°”, and 
“Round toward -e»".The Intel? Xeon Phi™ coprocessor supports "Round to Nearest Even", "Round toward 0”, "Round 
toward +°°”, and "Round toward -e»" for DP FP operations via VXCSR. In addition, a new instruction VROUNDPD is added 
to allow you to set rounding mode on the fly. 


To meet OpenCL* per-instruction rounding mode, support is provided for per-instruction rounding mode for register- 
register arithmetic instructions. This is accomplished by setting the NT bit and using on-the-fly swizzle control bits as per- 
instruction rounding mode control bits. For details please see the (Intel? Xeon Phi™ Coprocessor Instruction Set 
Reference Manual (Reference Number: 327364)). 


The Intel? Xeon Phi™ coprocessor also supports denormal numbers for DP FP in hardware for all float64 arithmetic 
instructions. Denormal number support includes input range checking and output data normalization. 


7.8 Vector ISA Overview 


Historically, SIMD implementations have a common set of semantic operations such as add, subtract, multiply, and so 
forth. Where most SIMD implementations differ lies in the specific number of operands to an operator, the nature of 
less common operations such as data permutations, and the treatment of individual elements contained inside a vector 
register. 


Like Intel's AVX extensions, the Intel? Xeon Phi™ coprocessor uses a three-operand form for its vector SIMD instruction 
set. For any generic instruction operator, denoted by vop, the corresponding Intel? Xeon Phi™ coprocessor instruction 
would commonly be: 


Vop:::zmml,:zmm2, :zmm3 
Where zmml, zmm2, zmm3 are vector registers, and vop is the operation (add, subtract, etc.) to perform on them. The 
resulting expression! would be: 

zmml-zmm2:::vop:::zmm3 


Given that the Intel Architecture is a CISC design, the Intel? Xeon Phi™ coprocessor allows the second source operand 
to be a memory reference, thereby creating an implicit memory load operation in addition to the vector operation. The 
generic representation of using such a memory source is shown as: 


vop:::zmml,:zmm2,; [ptr] 
zmm1 = zmm2: ::vop::: MEM [ptr] 


Any memory reference in the Intel? Xeon Phi™ coprocessor instruction set conforms to standard Intel Architecture 
conventions, so it can be a direct pointer reference ([rax]) or an indirect ([rbp] + [rcx]); and can include an immediate 
offset, scale, or both? in either direct or indirect addressing form. 


1 This is true for two-operand operators, such as arithmetic + or x. For those operators that require additional operands, such as 
carry-propagate instructions or fused multiply-add, a different form is used. 
S e.g., An address could be of the form [rbp]+([rax]*2)+0xA43C0000. 
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While these basics are relatively straightforward and universal, the Intel? Xeon Phi™ coprocessor introduces new 
operations to the vector instruction set in the form of modifiers. The mask registers can be understood as one type of 
modifier, where most vector operations take a mask register to use as a write-mask of the result: 


vop :::zmml: (k1),: zmm2,: zmm3/ptr 


In the above expression, the specifier k1 indicates that vector mask register number one is an additional source to this 
operation. The mask register is specified inside curly brackets {}, which indicates that the mask register is used as a 
write-mask register. If the vector register has COUNT elements inside it, then the interpretation of the write-mask 
behavior could be considered as: 


for (i20; i«COUNT; i++ 1 
{ 

if (k1[i] == 1) 

zmml[i] = zmm2 vop zmm3/MEM[ptr] 4 


} 


The key observation here is that the write-mask is a non-destructive modification to the destination register; that is, 
where the write-mask has the value 0 no modification of the vector register’s corresponding element occurs in that 
position. Where the mask has the value 1, the corresponding element of the destination vector register is replaced with 
the result of the operation indicated by vop. Write-masking behavior is explored in more detail later Section 7.10. 


Another modifier argument that may be specified on most SIMD vector operations is a swizzle, although the specific 
swizzle behavior is determined by whether the arguments are from registers or memory. The first type of swizzle is only 
permitted when all operands to the vector operation are registers: 


vop:::zmml:[:(kl]):],:zmm2,:zmm3:[:(swizzle]):] 


Here, square brackets [:] denote that the write-mask and the swizzle are optional modifiers of the instruction. The 
actual meaning of the swizzle, also denoted with curly brackets (:) (just as write-masks are denoted), is explained in 
depth in Section 7.12. Conceptually, an optional swizzle modifier causes the second source argument to be modified via 
a data pattern shuffle for the duration of this one instruction. It does not modify the contents of the second source 
register, it only makes a temporary copy and modifies the temporary copy. The temporary copy is discarded at the end 
of the instruction. 


The swizzle modifier that the Intel? Xeon Phi™ coprocessor supports has an alternate form when used with the implicit 
load form. In this form, the swizzle acts as a broadcast modifier of the value loaded from memory. This means that a 
subset of memory may be read and then replicated for the entire width of the vector architecture. This can be useful for 
vector expansion of a scalar, for repeating pixel values, or for common mathematical operations. 


One subtle aspect of the Intel? Xeon Phi™ coprocessor design is that each vector register is treated as though it entirely 
contains either 32-bit or 64-bit elements. Figure 7-2 depicts the organization of the vector register when working with 
32-bit data. 


One element, 32b in size 


Figure 7-2. Vector Organization When Operating on 16 Elements of 32-bit Data 


When executing an Intel? Xeon Phi'" coprocessor vector instruction, all arithmetic operations are carried out at either 
32-bit or 64-bit granularity. This means that, when manipulating data of a different native size such as a two-byte float 
called £10at16, a different mathematical result might be obtained than if the operation were carried out with native 
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float16 hardware. This can cause bit-differences between an expected result and the actual result, triggering 
violations of commutatively or associativity rules. 


Intel? Xeon Phi™ coprocessor includes IEEE 754-2008 [ (Institute of Electrical and Electronics Engineers Standard for 
Floating Point Arithmetic, 2008)]-compliant, fused multiply-add (FMA) and fused multiply-subtract (FMS) instructions as 
well. These instructions produce results that are accurate to 0.5ulp? (one-rounding) as compared to separate multiply 
and add instructions back-to-back as well as the "fused" multiply-add instructions of other architectures that produce 
results of 1.0ulp (two-rounding). In the case of Intel? Xeon Phi™ coprocessor's fused instructions, the basic three- 
operand instruction form is interpreted slightly differently: 


zmm1 = zmm1:::vop1:::zmm2:::vop2:::zmm3 
FMA operations, for example, may have vop, set to x and vop, set to +. The reality is richer than this. As mentioned 


previously, the ability to perform implicit memory loads for the second source argument zmm3, or to apply swizzles, 
conversions, or broadcasts to the second source argument, allows a wider range of instruction possibilities. In the 
presence of FMA and FMS, however, this restriction may lead to cumbersome workarounds to place the desired source 
field as the second source in the instruction. 


Therefore, the Intel® Xeon Phi™ coprocessor instruction set provides a series of FMA and FMS operations, each one 
numbered in a sequence of three digits to the order field interpretation. This allows you to use the modifiers without 
knowing the particulars of the features. For example, the FMA operation for 32-bit floating-point data comes with these 
variants: vmadd132ps, vmadd213ps, and vmadd231ps. The logical interpretation" is seen from the numeric string 
embedded in each mnemonic: 


vfmaddl32ps:::zmml,zmm2,zmm3 : | zmml-zmmlxzmm34zmm2 
vfmadd213ps:::zmml,zmm2,zmm3 : | zmml-zmm2xzmml-4zmm3 
vfmada23lps:::zmml,zmm2,zmm3 : | zmml-zmm2xzmm34zmml 


Memory loads, modifiers such as swizzle, conversion, or broadcast, are only applicable to the zmm3 term. By selecting a 
mnemonic, you can apply the modifiers to different locations in the functional expression. 


The Intel? Xeon Phi™ coprocessor also introduces a special fused multiply-add operation that acts as a scale and bias 
transformation in one instruction: vfmadd233ps. The interpretation of this instruction is best summarized in a series 


>.> > 


of equations. So, the vfmadd233ps of the form vfmadd233ps: z, u, v generates the following: 


[3.0 = ~“w[3..0]x°v [1]+°v [0] 
[7.4] = "u[7.4]x"v [5]+°v [4] 
[11.8] = "w[11.8]xv [9]-v [8] 
"z[15.12] = "u[15..12]xv [13] [12] 


To make this example more concrete, the operation vfmadd233ps zmml, zmm2, zmm3 results in the destination 
zmm1 values shown in Table 7-3. 


Table 7-3. The Scale-and-Bias Instruction vfmadd233ps on 32-bit Data 


zmm1[0] = zmm2[0] X zmm3[1] *zmm3[0] zmm1[1] = zmm2[1] X zmm3[1] +zmm3[0] 
zmm1[2] = zmm2[2] X zmm3[1] *zmm3[0] zmm1[3] = zmm2[3] X zmm3[1] *zmm3[0] 


? Unit in the Last Place (ulp), a measure of the accuracy of the least significant bit in a result. 
^ For simplicity, the third operand is shown as vector register v3, although it could alternately be a memory reference. 
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zmm1[4] = z;mm2[4] X zmm3[5] +zmm3[4] 


zmm1[5] = zmm2[5] X zmm3[5] *zmm3[4] 


zmm1[6] = zmm2[6] X zmm3[5] +zmm3[4] 


zmm1[7] = zmm2[7] X zmm3[5] +zmm3[4] 


zmm1[8] = zmm2[8] X zmm3[9] *zmm3[8] 


zmm1[9] = zmm2[9] X zmm3[9] *zmm3[8] 


zmm1[10] = zmm2[10] X zmm3[9] *zmm3[8] 


zmm1[11] = zmm2[11] X zmm3[9] *zmm3[8] 


zmm1[12] = zmm2[12] X zmm3[13] +zmm3[12] 


zmm1[13] = zmm2[13] X zmm3[13] *zmm3[12] 


zmm1[14] = zmm2[14] X zmm3[13] +zmm3[12] 


zmm1[15] = zmm2[15] X zmm3[13] *zmm3[12] 


The Intel? Xeon Phi™ coprocessor also introduces vector versions of the carry-propagate instructions (CPI). As with 
scalar Intel Architecture carry-propagate instructions, these can be combined together to support wider integer 
arithmetic than the hardware default. These are also building blocks for other forms of wide-arithmetic emulations. The 
challenge incurred in the vector version of these instructions (discussed in detail later on) is that a carry-out flag must be 
generated for each element in the vector. Similarly, on the propagation side, a carry-in flag must be added for each 
element in the vector. The Intel? Xeon Phi™ coprocessor uses the vector mask register for both of these: as a carry-out 
bit vector and as a carry-in bit vector. 


There are many other additions to the Intel? Xeon Phi™ coprocessor instruction set, for use in both scalar and vector 
operations. Some are discussed later in this guide, and all may be found in (Intel? Xeon Phi™ Coprocessor Instruction Set 
Reference Manual (Reference Number: 327364)) 


7.9 Vector Nomenclature 


The microarchitecture design that implements the Intel? Xeon Phi'" coprocessor instruction set has certain artifacts that 
can affect how operations and source modifiers work. The remainder of this section explores these concepts in depth, 
focusing on the principles of supporting 32-bit data elements in vectors and pointing out microarchitecture limitations 
and their implications along the way. 


Recall that each vector register is 64 bytes wide, or 512 bits. It is useful to consider the vector register as being 
subdivided into four lanes, numbered 3 ... 0, where each lane is 128 bits wide. This is shown graphically in Figure. Lanes 
are always referred to by their lane number. 


Lane 3 Lane 2 Lane 1 Lane 0 


Figure 7-3. Vector Register Lanes 3...0 


While this bears a superficial resemblance to the prior Intel? SSE Architecture, which uses 128-bit vector registers, the 
actual implementation is quite different and the comparison is strongly discouraged. Viewing the Intel? Xeon Phi™ 
coprocessor from the perspective of SIMD Streaming Extensions will limit your comprehension of the capabilities that 
Intel? Xeon Phi™ coprocessor provides. 


There are four 32-bit elements, in a 128-bit lane, identified by the letters D...A. Regardless of which lane, elements 
within a lane are always referred to by their element letter (as shown in Figure 7-4). 
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Lane 3 | Lane 2 [ Lane 1 i Lane 0 i 


Figure 7-4. Vector Elements D...A Within a Lane 


In contrast, when discussing all 16 elements in a vector, the elements are denoted by a letter from the sequence P...A, as 
shown in Figure 7-5.. 


PIOINIJMILIKIJI|I!]H|G|F|EJDI|C|BI|A 


Figure 7-5. Vector Elements P...A Across the Entire Vector Register 


The terms lane and element, as well as their enumerations, are standard usage in this guide. 
In the memory storage form for vectors in the Intel? Xeon Phi™ coprocessor, the lowest address is always on the right- 
most side, and the terms are read right-to-left. Therefore, when loading a full vector from memory at location 0xA000, 


the 32-bit vector elements P...A are laid out accordingly: 


Address Element 


0xA000 a 
0x A004 b 
OxA03C p 


7.10 Write Masking 


The vector mask registers in Intel? Xeon Phi™ coprocessor are not general-purpose registers in the sense that they can 
be used for arbitrary operations. They support a small set of native operations (such as AND, OR, XOR) while disallowing 
many others (such as + or x). This prevents the use of the vector mask registers as a form of 16-bit general-purpose 
register or as a scratch pad for temporary calculations. Although the Intel® Xeon Phi™ coprocessor allows load and store 
v-mask to general purpose register, general-purpose mathematical operations do not exist for the vector mask 

registers. 
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Figure 7-6. Basic Vector Operation Without Write Masking Specified 


The vector mask registers are intended to control the update for vector registers inside a calculation. In a typical vector 


operation, such as a vector add, the destination vector is completely overwritten with the results of the operation. This 
is depicted in Error! Reference source not found.. 


The two source vectors (shown in yellow) combine to overwrite the destination vector (shown in blue). The write-mask 
makes the destination overwrite for each element conditional, depending on the bit vector contents of a vector mask 
register. This is depicted in Error! Reference source not found.. 


+ + 4 + 4 we 3kocWe e e dk zé bk oo 
p^ 


4» 


&99*9- 99.99.9999». 


Figure 7-7. Basic Vector Operation With a Write Masking Specified 


à Alternately, think of making the per-element destination update predicated by the write mask. 
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Figure 7-8. Effect of Write Mask Values on Result 


Here, the write mask register (shown in green) has been added. Now each element in the destination vector is 
conditionally updated with the results of the vector operation, contingent on the corresponding element position bit in 
the mask register. Figure 7-8 shows how the write-mask values affect the results. 


It is important to note that, unlike other Intel? Advanced Vector Extensions’, the write-mask behavior in the Intel? Xeon 
Phi™ coprocessor is non-destructive to the destination register. As shown in Error! Reference source not found., where 
the write-mask register has the value zero, the destination register is not modified for that element. Prior Intel 
Architecture-based vector extensions were destructive; that is, when a mask value was set to zero, the destination 
element was cleared (all bits reset to zero). 


Write mask support is present on every Intel? Xeon Phi™ coprocessor vector instruction; however, the specification is 
entirely optional. The reason is that for instructions which do not specify a mask register, a default value of OxFFFF is 
implied and used. 


Of special note is the mask register kO. The instruction encoding pattern that corresponds to mask register kO is used to 
represent the default behavior when write masks are not specified (implying a default mask of OxFFFF). So specifying 
the "real" kO as a write-mask register (which may not be equal to OxFFFF) is not allowed. Though this mask register 
cannot be used as a write mask register for Intel? Xeon Phi™ coprocessor vector instructions, it can be used for any of 
the other mask register purposes (i.e., carry propagate, comparison results, etc.). A simple rule of thumb to determine 
where kO cannot be used is to examine the Intel? Xeon Phi™ coprocessor instruction set closely — any mask register 
specified inside of the curly braces {:} cannot be designated as kO, while any other use can be kO. 


Write masking can introduce false dependencies from a hardware scheduler perspective. Consider two different mask 
registers, k1 set to the value OxFFO00 and k2 set to the value OxOOFF. If two vector operations are issued back-to- 
back with these mask registers, such as: 


vaddps zmm1:{k1},zmm2,zmm3 
vaddps zmm4: {k2},zmm5,zmm1 


Then, for these instructions and assumed mask values, it is apparent that no true dependency exists in the register set. 
The dependency through zmm1 is false because the half of the vector register that is not overwritten by the first 
instruction is used as an input to the second instruction. The Intel® Xeon Phi™ coprocessor’s hardware schedule cannot 
examine the masks early enough in the pipeline to recognize this false dependency. From the scheduler's perspective, all 
write masks are considered to have the value OxFFFE". 


€ Such as Intel® Advanced Vector Extension (AVX). 
7 o å s : 
The write mask is also not consulted for memory alignment issues. 
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Consider applying the simple Newton-Raphson root-finding approximation algorithm to a problem f(x) as an example for 
how write masking is used in general: 


When this is implemented, the calculation kernel is an iterative algorithm that runs in a loop until the error term falls 
below a preselected target threshold. For whatever function is being evaluated, each successive iteration tests the value 


x to determine whether Dua | <e. 


Translating this behavior into use of the write mask is straightforward. Assuming 32-bit data values, the kernel would 
operate on 16 different versions of the algorithm concurrently. That is, if you are trying to compute f(x) for the values a, 
b, ..., p, then you start with the write mask set to a default value of all ones. At the end of each loop body, a vector 
comparison is done to test whether the computed values a, b, ^» p, are individually satisfying the constraint 


| f«,) feo) | >£. The result of this vector comparison along with a reversal of the original comparison test (<e) will rewrite 


the mask register to have the value one only for those elements that have not converged to meet the desired threshold 
g. An Intel? Xeon Phi™ coprocessor code sequence implementing a native version of the algorithm for approximating the 
square root of a number is shown in Figure 7-9. The modifier (1to16) used on source lines 8, 16 and 31 is explained in 
Section 7.11.3.1. 
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; reset mask kl to OxFFFF 
kxnor kl, kl 


; load next 16 values 
vmovaps zmmO, [data] 


; load initial guess 
vbroadcastss zmml, [guess] 


loop: 
; Generate a new guess for the square root 


; compute (guess^2) 
vmulps zmm2 (kl), zmml, zmml 

; compute (2.0 * guess) 

vmulps zmm3 (kl), zmml, [two] {ltol6} 
; compute (guess^2 value) 

vsubps zmm4 (kl), zmm2, zmm0 

; compute (zmm4/zmm3) 
| vdivps zmm5 (kl), zmm4, zmm3 

; new guess = (old guess - zmm5) 
vsubps zmml (kl), zmml, zmm5 


; find the amount of error in this new guess 


; Square of the new guess 

lps zmm6 (kl), zmml, zmml 

guess^2 value) 

bps zmm7 {kl}, zmm6, zmmO 

-1.0f * (guess^2 value) 

lps zmm8 (kl), zmm7, [negl]{1tol6} 
error = abs| guess^2 value | 
axabsps zmm9 {kl}, zmm7, zmm8 


check against epsilon, and loop if necessary 


; kl = error > epsilon 

vcmpps kl (kl), zmm9, [epsilon] {ltol6}, gt 
; any elements to guess again? 

Kortest kl, kl 

; if so, repeat 

jnz loop 


Figure 7-9. A Native (Non-Optimal) Newton-Raphson Approximation Using a Write-Mask to Determine a Square Root 


The variables data and guess are symbolic, meaning that they would normally be an address calculation using 
architectural registers and displacements. The instruction vdivps is not a real instruction, but instead a library call using a 
template sequence as described later. 


The vector mask test instruction kortest (line 40) reports whether any bits in the bit-wise OR of the operand mask 
registers are set to the value one, updating the ZF and CF bits in the EFLAG register based on the outcome. This 


instruction is typically followed by a conditional branch — here, the branch will re-execute the loop to compute X d for 


those elements that have not yet converged. This use of the write mask may also be used as a mechanism for reducing 
total power consumption, as those elements which are masked off are (in general) not computed. In the case of the 
Newton-Raphson approximation; if half the vector elements have converged, then only half the vector ALUs will be 
used in the next iteration of the kernel. The remainder will be idle and will thereby consume less power. Other uses for 
the vector mask registers will be discussed in later. 
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7.11 Swizzling 


Swizzles modify a source operand for the duration of a single instruction. Swizzles do not permanently alter the state of 
the source operand they modify. The proper conceptual model is that a swizzle makes a temporary copy of one input to 
an instruction, performs a data pattern combination on the temporary value, then feeds this altered temporary value to 
the ALU as a source term for one instruction. After the instruction is completed, the temporary value generated by the 
swizzle is discarded. 


In effect, swizzles are a way to create data pattern combinations without requiring extra registers or instructions — as 
long as the desired combination is possible with the swizzle support at the microarchitectural level. Swizzles, like write 
masks, are optional arguments to instructions: 


vop:::zmml:[:(kl):],:zmm2,:zmm3/ptr:[:(swizzle):] 


While swizzles are powerful and useful, there are restrictions on the types of swizzles that may be used for a given 
instruction. These restrictions hinge upon the actual form of the instruction, but they also carry microarchitectural 
implications with them. In order to understand why swizzles work the way they do, it is useful to understand how the 
hardware support for swizzles works. A conceptual diagram of the swizzle logic is shown in Figure 7-10. 


512 bit source E 
128 bit lane 128 bit lane 128 bit lane 128 bit lane 


Src 2 Lane 3 Lane 2 Lane 1 Lane 0 


Lane 
Muxes 


Tmp 1 


Element 
Muxes 


Tmp 2 


To ALU 


Figure 7-10. Partial Microarchitecture Design for Swizzle Support 


The second source operand (shown in green) for a given instruction is the basis for all swizzle modifiers. This source is 
conceptually broken into the same 128-bit lanes as previously described. Each of the four lanes from the second source 
is presented to a multiplexor (mux) that is 128 bits wide. 


Recall that the swizzle changes a source for the duration of the instruction by making a temporary copy (shown in 
yellow) of the source. The /ane mux is capable of choosing which source lane is fed into which temporary copy lane. The 
lane multiplexors (muxes)allow for the remapping of any valid combination of source lanes into the temporary value 
lanes, such as taking the source lane pattern {3210} and creating a new lane pattern {3302}. 


The first temporary value is then fed through a second set of data muxes to generate a second temporary value (shown 
in blue). The second set of muxes operates on the data elements within a lane from the first temporary value; and are, 
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therefore, called the element muxes. Element muxes are 32 bits wide and allow data pattern combinations within the 
lane. Taking a lane's source pattern in the first temporary value of the form (dcba), the element muxes can be used to 
make a new combination such as {adbb} The full implementation is shown in Figure 7-11 and depicts all the muxes in 
the swizzle logic. 


512 bit source 


128 bit lane 128 bit lane 128 bit lane 
Lane 2 Lane 1 Lane 0 


|«^— 128 bit lane 


Src 2 Lane 3 
Tmp 1 DD 
Element ; 
Muxes LILL, LLLI LLLL LLL LLLI LLLI LLL 
Tmp 2 PlolN|MI!I | K | J | i[|H|e|r|e|p|c|a 


To ALU 


Figure 7-11. The Complete Microarchitecture Design for Swizzle Support 


The second temporary value computed reflects the results of both lane muxes and the element muxes acting on the 
source value. This result is, in turn, fed to the ALU for the requested instruction. There are several important 
implications to be drawn from the mux network design that impact how the software should be written and that limit of 
the use of swizzles. 


7.11.1 Swizzle Limitations 


While all of the hardware capabilities of the swizzle logic previously described are fully supported by the shuffle 
instruction and the lane muxes are driven by some memory-form swizzles, general swizzles do not allow you to specify a 
lane mux pattern. In those cases, the default behavior is that the source lane is fed directly to the same lane in the first 
temporary value copy. The implication of this rule is that the /ane boundaries are hard boundaries. The hardware cannot 
combine elements from different source lanes into the same lane at the conclusion of the swizzle. The only way to cross 
a lane boundary is through load or store operations, or through the vector shuffle instruction. 


Second, to drive the element muxes completely would require 32 bits of immediate encoding on every instruction. This 
is clearly impractical, so the Intel? Xeon Phi™ coprocessor presents a choice of eight predefined swizzle patterns. This 
allows all swizzles to be represented in a three-bit field in every instruction's encoding. Whichever pattern is picked in a 
swizzle operation, that pattern is applied across all lanes. 


When all operands to an instruction come from the vector register file, the choice of predefined swizzles to be applied 
are made from the Register-Register form, indicating that they operate on a source register. When the second source 
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argument comes from memory instead of from the vector register file, the available swizzles are from the Register- 
Memory form. Each type of swizzle behavior is explored below. 


7.11.2 Register-Register Swizzle Form 


The register-register form of the swizzle is only available when all operands to an instruction have vector register 
sources, conforming to the syntax: 


vop:::zmml:[:(kl]):],:zmm2,:zmm3:[:(swizzle]):] 


To illustrate the interpretation of the register-register swizzle form, we will walk through an example vector instruction: 
vorpd zmmO (kl), zmml, zmm2 {aaaa}. This instruction is a bit-wise OR operation, using signed 32-bit 
integers. The destination zmmO will be modified on a per-element basis, with that control specified by the write mask 
register k1 values. The two sources to the OR operator are the vector registers zmm1 and zmm2. Figure 7-12 shows the 
initial evaluation of this instruction and the swizzle logic. 


vorpi vO {k1}, v1, v2 {aaaa} 


Src2 (v2) Lane 3 Lane 2 Lane 1 Lane 0 


Tmp 1 


Figure 7-12. Register-Register Swizzle Operation: Source Selection 


As depicted, the second source vector zmm2 becomes the input to the swizzle logic. Since swizzles cannot drive the lane 
muxes, the entire source is pushed straight through the first temporary value stage in preparation for driving the 
element muxes. The swizzle field, denoted by {aaaa} will directly drive the element muxes. Figure 7-13 shows the 
element muxes that will be configured. 


vorpi vO {k1}, v1, v2 {aaaa} 


Tmp 1 (v2) 


Element 


ett" 


Tmp 2 


Figure 7-13. Register-Register Swizzle Operation: Element Muxes 


As with the vector element enumeration, the swizzle is read right-to-left in order to drive the element muxes. The right- 
most designator of the swizzle (a in this case) indicates that the first element in the final destination should be set equal 
to the designated swizzle selection (a) element source of the lane. The second-most designator of the swizzle (a again) 
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indicates that the second element in the lane should be set to the designated swizzle element (a). This is repeated 
across the entire lane, as shown in Figure 7-14. 


vorpi vO {k1}, v1, v2 {aaaa} 


Tmp 1 (v2) 
Es E apupua uju 
Tmp 2 A|A|A|A 


Figure 7-14. Register-Register Swizzle Operation: First Lane Completion 


Remember, whatever swizzle pattern the instruction specifies drives the element muxes across all lanes in the same 
pattern. So, each lane driven has the same result, although the actual per-lane data values differ. Figure 7-15 shows the 
final result of the swizzle applied to the second source argument, with the final value driven to the ALU for the full 
instruction to operate on — in this example, the vorpd instruction. 


vorpi vO {k1}, v1, v2 {aaaa} 


Tmp 1 (v2) 


Element 
Muxes 


Tmp 2 "Zr ZDnr BRECH alalala 


To ALU 


Figure 7-15. Register-Register Swizzle Operation: Complete Swizzle Result 


There are eight predefined swizzle patterns that operate in register-register mode: aaaa, bbbb, cccc, dddd, dach, 
badc, cdab, and dcba. The first four patterns correspond to repeat element selections, and are useful in many 
operations such as scalar-vector arithmetic. The next three patterns are common interleaving patterns, corresponding 
loosely to interleave pairs, swap pairs, and rotate pairs. These three are applicable to a wide range of arithmetic 
manipulations, including cross products (matrix determinant) and horizontal operations (such as horizontal add). The 
last pattern (dcba) corresponds to a no-change pattern, where the data is not reordered. The no-change pattern is the 
default pattern if no swizzle argument is specified in an instruction. 
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7.11.3 Register Memory Swizzle Form 
The register-memory form of the swizzle is available for all implicit load operations, conforming to the syntax: 


vop:::zmml:[:(kl]):],:zmm2,:ptr:[:(swizzle]):] 


In register-register swizzles, you specify data pattern combinations that are applied to a temporary copy of the second 
source argument. In register-memory swizzles, you specify either a data broadcast operation (data replication), or else a 
data conversion. 


With the implicit in-line load operation form, it is not possible to specify both a conversion and a broadcast on the same 
instruction. In order to have both broadcast and conversion, as well as a larger range of conversion choices, a true vector 
load operation must be used. 


7.11.3.1 Data Broadcast 


The purpose of the data broadcast swizzle is to perform data replication. Instead of loading a full 64-byte vector width 
of data from the cache hierarchy, a subset of that data is loaded. This subset will then replicate itself an integral number 
of times until the full 64-byte width is achieved. This is useful for optimizing memory utilization into the cache hierarchy. 


There are three predefined swizzle patterns that operate in register-memory mode when performing data broadcast 
operations: 1t016, 4to16, and 16to16. The last pattern (16to16) corresponds to a load all pattern, where the data 
is not broadcast. This is the default pattern if no swizzle argument is specified for an instruction that takes this form. 


The interpretation of the swizzle designator is to read the two numbers involved directly. The first is the number of 32- 
bit elements to read from memory, while the second is the total number of 32-bit elements being populated. In the case 


of 1t016, exactly one 32-bit value is being read from memory, and it is being replicated 15 times to make a total of 16 
values. This is depicted in Figure 7-16. 


vorpi vO {k1}, v1, [rax] {1to16 


[rax] (mem) Lane 3 Lane 2 Lane 1 Lane 0 A 


Lane 
Muxes 


Tmp 1 


Element 
Muxes 


Tmp 2 


To ALU 


Figure 7-16. The 1-to-16 Register Memory Swizzle Operation 
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Note that, while you cannot choose lanes with the swizzles, the register-memory swizzle modifiers conceptually have 
the ability to modify lane mappings. The modifier 1to16 does just that in replicating the data loaded from memory. 
Thus lane zero is replicated to all lanes in the figure. Two common use cases for the 1to16 modifier are: performing a 
scalar-vector multiply and adding a constant to each element in a vector. 


The modifier 4t016 is depicted in Figure 7-17. All the lane muxes have been driven to point to the source lane zero, but 


a full 128 bits have been loaded from memory corresponding to the elements d, c, b, a. The four loaded values from 
memory are replicated across each lane. 


vorpi vO {k1}, v1, [rax] {4to16 


[rax] (mem) Lane 3 | Lane 2 | Lane 1 
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Figure 7-17. The 4-to-16 Register-Memory Swizzle Operation 


The default (implied) 16to16 operator loads the full 16 elements from memory, driving the lane and element muxes in 
the default pass-through configuration. The 1 6to1 6 operator is optional. 


7.11.3.2 Data Conversion 


The purpose of the data conversion modifier is to use the swizzle field to read data from memory that is in a convenient 
memory storage form, and to convert it to match the 32-bit wide native Intel® Xeon Phi™ coprocessor data type of 
vector elements. As stated previously, the Intel® Xeon Phi™ coprocessor performs all operations assuming a collection of 
16 elements, each 32-bits in size. There are four predefined swizzle patterns that operate in register-memory mode 
when performing data-conversion operations. A different set of four conversions is available depending on whether the 
conversion is to a destination floating-point or to an integer operation. 


For floating-point instructions, conversions from memory are supported via inline swizzles: sint16 and uint8. For 
integer (signed or unsigned) instructions, conversions from memory are supported via inline swizzles: uint8, sint8, 
uint16,and sint16. If no swizzle conversion is specified, then the load from memory undergoes no conversion, 
meaning that the data is assumed to be in the appropriate native format in the memory storage form (float 32 or 
signed/unsigned int32). Figure 7-18 depicts the conversion process for vorpi from a memory storage form of uint8 
to the sint32 expected for the instruction. 
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Figure 7-18. The uint8 Data Conversion Register-Memory Swizzle Operation 


For this particular vorpd instruction, the data in memory consists of a series of one-byte items. The conversion process 
reads the uint8 data from memory, and converts it to a signed 32-bit integer. The 16 converted values are then fed to 
the ALU for the OR operation to use. For this operation, the floating-point conversion types are disallowed because 
vorpd requires signed integers — thus, £loat16 is an illegal conversion specifier in this example. 


7.11.4 Swizzle Usage Examples 


In order to demonstrate scenarios where swizzles are useful, three examples are briefly explored: scalar-vector 
multiplication, horizontal addition, and 3x3 matrix determinants. 


7.11.4.1 Scalar-Vector Multiplication 


There are two readily usable forms for scalar-vector multiplication available. The conceptual difference between the two 
forms hinges on whether you are using lane semantics (where each vector is broken into four lanes) or total vector 
semantics (where each vector is considered a collection of 16 elements without additional abstraction levels). 


For the lane semantic case, scalar-vector products expect each lane to have a different scalar multiplier applied to the 
destination vector. For example, consider the following vectors: 
"us SCHAN S4XXX 4X XX S4, XXXSg > and “v= <p,0,,a > 


where the value x represents a value that is irrelevant to the operation. A lane-based scalar-vector product would result 
in the vector: 


z = < 
S, <p,o,n,m S, SLERE , > 
H <h,g fre So <d,&b,a > 
> 

z = < 
Sy p,5.,0,S,n,S MS LS, kS jS i, 
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In terms of Intel® Xeon Phi™ coprocessor programming, this operation is easily conveyed in one instruction, as long as 
the scalar multipliers reside in a vector register and each scalar multiplier resides in the same per-lane element field 
(e.g., each multiplier is in the position of element a for every lane). Let zmm1 = "u and zmm2 = "v. The scalar-vector 
multiplication for lane semantics, assuming floating-point data with a destination of zmmO and a write mask of k1, is 
simply: 


vmulps:::zmmO:(k1]),:zmm2,:zmml:[(aaaa] 


This single instruction will obtain the desired result, however using it frequently may not be optimal. Use of the register- 
register swizzle modifier incurs a *1 clock penalty on the latency? of an instruction using the swizzle. Therefore, if the 
same data value (here, zmm1) is frequently reused with the same swizzle, it might be better to generate a temporary 
register with the desired data pattern combination and then refer to that temporary register. 


For total vector semantics, the operation follows traditional mathematic rules. For some constant, k, the scalar-vector 
multiplication against a vector °u= «p,n,,a > would result in °z=k <p,0,,a >. This, too, can be achieved 
with a single Intel? Xeon Phi™ coprocessor instruction, assuming the constant k resides in memory: 


vmulps:::zmmO:(klj,:zmm2,:[k]:í(1to10] 


In this case, the memory storage form must match the expected input to the instruction, which for vmulps is a 32-bit 
floating-point value. While this operation will get the desired result, it consumes cache bandwidth. In a situation where 
this constant is used repeatedly, it might be better to load the constant (with broadcast replication) into a temporary 
vector register, and then use the temporary register to preserve cache bandwidth. 


7.11.4.2 Horizontal Addition 


From the viewpoint of SIMD architecture, the Intel? Xeon Phi™ coprocessor implements wide vectors. Horizontal 
operations (also known as reduction operations) do not scale well in microarchitecture implementations due to the 


inherent serial nature of the operation. For any given vector v=: SCH, ac, aha > a horizontal operation on "v, such as 


n-1’ 
addition, requires the computation of: 
n-L 


i=0 


This computed value is commonly stored in either the beginning or ending element of the resulting vector, or else in 
every element position of the resulting vector. Some forms of horizontal operations are defined to leave partial results 
in each element of the vector. That is, if the summation were redefined to be: 

n-1 

fin, v) > V. 

i=0 
then some horizontal operations would require computing the resultant vector. Where n is the number of elements in 
the vector, the resultant vector becomes: 


2 = (f(n, 8), f (n —1,9),..., f (1, 2) 


8 This is an observed extension of the pipeline depth; it does not trigger a core stall or otherwise impede the thread as long as 
excepting standard dependency rules. 
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The Intel? Xeon Phi™ coprocessor does not provide direct support for horizontal operations (They cannot be scaled 
efficiently in microarchitectures as the SIMD vector width lengthens.). As a general observation, most needs for 
horizontal operations stem from inner-loop unrolling, where data parallelization is applied to the data set of the inner- 
most loop body. Horizontal or reduction operations are common in this type of parallelization. In contrast, outer-loop 
unrolling achieves a different effect by parallelizing instances through each element in the vector. There is rarely a need 
for horizontal operations in outer-loop unrolling. An extreme form of outer-loop unrolling would be the typical 
programming model of High-Level Shader Language (HLSL), where operations are defined only on scalar data types. In 
the HLSL case, parallel threads are run so that thread t, is executed in vector element V. In this case, execution of parallel 


threads occurs where the number of threads being executed equals the width of vector elements supported by the 
underlying microarchitecture. 


The Intel? Xeon Phi™ coprocessor programming uses outer-loop unrolling and a more aggressive form of thread 
parallelism through the vector elements. From this perspective, the lack of horizontal operations does not impede highly 
efficient programming constructs. When situations arise where horizontal operations are needed, they tend to happen 
at the final stage. When combining work from multiple virtual threads in the vector elements, a few extra cycles in the 
calculation to emulate horizontal operations has a negligible effect on overall program performance. 


Keeping this in mind, it is still possible to perform horizontal operations efficiently even though the Intel? Xeon Phi™ 
coprocessor lacks single instruction versions of common horizontal operations. For example, to compute the dot- 
product of two vectors, use a vector multiply followed by a horizontal add. To continue with the example of horizontal 
add, let vector ? = (d, c, b, a). The horizontal add applied to °v will return a vector: 


z=(d+c+b+ad+c+b+ad+c+b+ad+c+b+a) 


The sequence of Intel? Xeon Phi™ coprocessor instructions shown in Figure 7-19 can easily be constructed, assuming an 
input vector zmmO with the contents (x, x, ..., x, d, c, b, a), where x is a data value that is uninteresting. 


; Goal: zmml ..., d*cc*b*a, dtctbt+a, dtctbta, dtc 
bta) 
vaddps zmml, zmm0, zmmO {cdab} 

value in Lane 0 of zmml 

(d+c), (ctd), (bta), (a+b) 


vaddps zmml, zmml, zmml {badc} 


value in Lane 0 of zmml 
; (dtc)+(bta), (c+d)+(a+b), (bta)+(d+c), (a+b)+(c+d) 


Figure 7-19. Trivial Implementation of a Horizontal Add Operation Within One Vector Lane 


At the end of the addition, with the alternating swizzles cdab and bade, it is clear that the horizontal addition has 
completed within each lane of the vector zmm1. Using floating point commutativity’ rules, the actual result of the 
destination vector in lane zero is: 


< (dtc)+ (bta), (dtc)+ (bea), (dtc)+ (bta), (dtc)+ (bta)> 


2 Commutativity is limited when the x87 or up-conversion of a data type is involved. 
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If the horizontal operation uses floating-point data, be very careful when using the results as all four values violate the 
associativity rules set as IEEE industry standards. Unsafe compiler optimizations may produce these results, or 
programmers who are aware of the implications of violation may choose to use this technique regardless. Three Intel? 
Xeon Phi™ coprocessor instructions are required to get a correct horizontal add result with respect to floating point 
associativity, but the programmer must know whether to evaluate the terms left-to-right or right-to-left (e.g., 
(((a+b)+c)+d) versus (((d+c)+b)+a)). 


The two instructions used in Figure 7-19 to determine the horizontal add only compute that horizontal add result within 
each lane of the vector concurrently. If the horizontal add only applies to a single lane, then the previously mentioned 
instructions generate four horizontal add results. 


7.11.4.3 3x3 Matrix Cross Product (Determinant) 


A common vector operation computes a 3x3 cross-product (matrix determinant) to find a normal vector for two input 
vectors or the area of a parallelogram defined by two input vectors. For two vectors u=<c,b,a >and 
“v= «o,n,m >, the cross product is ux v= «an-bm,cm-ao,bo-cn >. While the cross product is 


only defined for either three- or seven-dimensional vectors, the calculation for the cross product is equal to calculating 
the determinant of a matrix. 


Using the swizzle modifiers carefully makes this computation easy and can attain four products for every three 
instructions. A simple code sequence that calculates the determinant in a lane is shown in Figure. It is actually possible 
to accumulate multiple determinants by adding the incremental results in the vector zmm2 (not shown). The final 
operation on line 3 is to correct the order of the terms in zmm3, but it follows that, if zmm2 is non-zero, then the result 
will be added to any prior data. 


zmmO = 

zmml 

zmm2 = <zero vector> 

Goal: zmm3 = (..., an-bm, cm-ao, bo-cn) 


operation value in Lane 0 of zmm3 


vmulps zmm3, zmmO, zmml {dacb} ; (cm, bo, an) 
vfnadd231ps zmm3, zmml, zmm0 {dacb} ; (cm-ao, bo-cn, an- 
bm) 

vaddps zmm3, zmm2, zmm3 ; (an-bm, cm-ao, bocn) 


Figure 7-20. Trivial Implementation of a 3x3 Matrix Cross-Product Within One Vector Lane 


As with the horizontal add, the three instruction sequence shown is computing a 3x3 matrix determinant in each lane of 
the vector. Therefore, four determinants are generated for every three instructions. This code sample can also be 
extended to higher-order square matrix determinant calculations. Higher-order determinants are calculated by iterative 
cofactor expansion through row or column reductions. 


7.12 The Shuffle 


Fundamentally, the power and ease of using swizzles on almost any instruction in the Intel? Xeon Phi"" coprocessor 
architecture makes manipulating data pattern combinations remarkably easy. There remain cases, however, where a 
more powerful data-pattern combination function is needed — these cases generally hinge on the need to cross lane 
boundaries, a feature that swizzles cannot provide. One option to cross a lane boundary is to use load-store operations 
to memory, but this has the undesirable side effect of increasing cache pressure and wasting the level one caches, which 
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are a limited resource. Using Ioad store operations also interferes with other scheduled activities, such as prefetching, 
gather-scatter operations, and so on. 


For generalized data pattern combinations and to cross lane boundaries, the proper instruction to use is the shuffle. The 
Intel? Xeon Phi™ coprocessor's shuffle instructions are a more advanced form of those found in Intel's SSE or AVX 
instruction sets. In the Intel? Xeon Phi™ coprocessor, the shuffle instructions are not fully generalized; that is, there are 
limitations to how you can combine the data pattern in one instruction. 


Unlike swizzle, the shuffle instruction allows you to specify the controls for the lane muxes. Any valid combination of 
lanes may be routed from the source to the first temporary value. The first restriction of the vector shuffle in the Intel? 
Xeon Phi™ coprocessor is that you can rearrange lanes, but you cannot combine elements from different lanes in one 
instruction. Eight bits of immediate encoding are required in the shuffle instruction in order to fully specify the lane mux 
selections. 


Once the lane sources are mapped, it becomes possible to drive the element muxes. To support fully arbitrary control 
sequences across all of the element muxes, however, would require 32 bits of immediate encoding on the shuffle 
instruction. Support for this is impractical in a pipeline implementation. To reduce the encoding space required, the 
vector shuffle instruction of Intel? Xeon Phi™ coprocessor allows you to specify an arbitrary element combination inside 
of a single lane. Once this combination is selected, the same pattern is applied across all the lanes. This is similar to how 
the swizzle works, in that the pattern is applied across all lanes; but unlike the swizzle any arbitrary combination may be 
chosen for the pattern. The shuffle instruction does not restrict you to eight possible combinations. 


The Intel? Xeon Phi™ coprocessor supports the following shuffle instructions: 


=  VPSHUFD 
vpshufd zmm1 (k1), zmm2/mt, imm8 
Shuffles 32-bit blocks of the vector read from memory or vector zmm2/mem using index bits in immediate. The 
result of the shuffle is written into vector zmm1. No swizzle, broadcast, or conversion is performed by this 
instruction. 

= VPERMF32X4 
vpermf32x4 zmm1 (k1), zmm2/mt, imm8 
Shuffles 128-bit blocks of the vector read from memory or vector zmm2/mem using index bits in immediate. 
The result of the shuffle is written into vector zmm1. No swizzle, broadcast, or conversion is performed by this 
instruction. 


Do not confuse the swizzle and the shuffle element patterns; the shuffle can take an arbitrary pattern, whereas the 
swizzle is limited to eight predefined patterns. If you receive compiler or assembler errors regarding a data pattern 
swizzle, the two immediate checks you should perform are: 


e Verify that you have all register operands 
e Verify that your swizzle combination pattern is legal 


7.13 Memory and Vectors 


Section 2 introduced the major components of Intel? Xeon Phi™ coprocessor's new vector architecture, including the 
general form of the vector instructions. That section also introduced the difference between all-register operands to an 
instruction and those instructions that include an implicit load by replacing the second source register with a memory 
reference. This type of inline load is common among CISC architectures, and can allow for better code density than other 
approaches. The material previously covered also introduced the idea of modifiers to the source arguments, where an 
implicit load can do either a limited form of data conversion through swizzles or data replication through broadcasts. 
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This section expands upon the prior foundation by looking at the true vector load and store instructions, with their 
extended capabilities beyond what the implicit load modifiers can support. Unusual vector load and store instructions, 
such as pack and unpack, are presented with their exact behaviors. This section also details the memory alignment 
restrictions that the Intel? Xeon Phi™ coprocessor places upon vector loads and stores of any form, as well as ways to 
work around the alignment restrictions. 


7.13.1 Load and Store Operations 


Implicit load operations on routine vector instructions, such as vaddps, allow you to select from a small set of data 
conversions or data replications. In contrast, the vector load instruction vmovaps allows specification of both, with a 
far larger set of data conversions supported. The basic form of the vector load and store instruction is: 


Store instruction: vmovaps mt {k1}, Df32(zmm1) 
Load instruction: vmovaps zmm1 {k1} Uf32(mt) 


Table 7-4 summarizes the range of choices in the optional (conversion) and (broadcast) fields. 


Table 7-4. The vmovaps Instruction Support for Data Conversion and Data Replication Modifiers 


Broadcasts Supported 


1to16 4to16 16to16 
Conversions Supported 
To From 

srgb8 sint8 uint8 snorm8 
unorm8 uint16 sint16 unorm16 

float32 
snorm16 float16 unorm10A unorm1B 
float11C 

sint32 sint8i sint16i 

uint32 uint8i uint16i 


More information about the forms can be found in various public specifications, as well as the Intel Architecture 
manuals [ (Intel® 64 and IA-32 Architectures Software Developer Manuals)]. 


Table 7-4 also shows that the memory storage form of sint16 can be converted to either float32 or sint32 
destination types. Note that if the conversion target is a signed or unsigned integer, the conversion type has an I 
postfix; therefore, uint8i and uint16i will convert from 8- or 16-bit unsigned integers into a 32-bit unsigned 
integer. Similarly, sint8i and sint16i will converted from 8- or 16-bit signed integers into a 32-bit signed integer. 
There is no direct conversion in the vmovaps instruction from unsigned to signed integer. 


The float32 target supports a wide variety of conversion types, including the half-precision float 16. Most of the 
conversion types supported by the vector load instruction are in direct support of the various graphics standards. The 
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distinction between unorm conversions in the range 10A,10B,10C,2D has to do with which field of the packed data 
format is converted (A, B, C, or D, respectively). Similarly, the float types including 11A,11B, and 10C center on which 
field is converted from the packed memory storage form (A, B, or C). The specific encoding rules for the various types 
are summarized in Table 7-5. 


Table 7-5. The Floating-Point Data Type Encodings Supported 


Type Encoding 
float16 s10e5 
float11 s6e5 
float10 s5e5 


When a write mask is used in conjunction with a vector load, only those elements that have their mask bit set to one are 
loaded. This is shown in Figure 7-21. 


SES ES EEE EE EE EE E SE 

S 6 8 8 àG 8 8 S8 § 5 5 5 8 SS S 
Mem Sr? ERE J TET AT 
Mask 

9 3 9 € 9 9 33 3€ 9$ 9$ 9€ 9$ 9 * 9 
Vector 


Figure 7-21. The Behavior of the Vector Load Instruction With a Write Mask 


When a write mask is used in conjunction with a vector store, only those elements that have their mask bit set to one 
are stored. This is shown in Figure 7-22. 
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Figure 7-22. The Behavior of the Vector Store Instruction With a Write Mask 


Where the mask bit of an element is set to zero, the corresponding vector element remains unchanged. For example, if a 
vector load/store is performed with a mask k1 set to OxFF00, then only the upper eight elements will be loaded from 
memory, starting at an offset of eight elements from the given load/store address. 
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All load/store operations must conform to the memory alignment requirements. Failure to use conforming addresses 
will trigger a general-protection fault. 


The following broadcast load instructions are supported: 


=  VBROADCASTF32X4 
Broadcast 4xfloat32 vector Uf32(mt) into vector zmm1, under writemask. The 4, 8 or 16 bytes (depending on 
the conversion and broadcast in effect) at memory address mt are broadcasted and/or converted to a float32 
vector. The result is written into float32 vector zmm1. 

=  VBROADCASTF64X4 
Broadcast 4xfloat64 vector Uf64(mt) into vector zmm1, under writemask. The 32 bytes at memory address mt 
are broadcast to a float64 vector. The result is written into float64 vector zmm1. 

= VBROADCASTSS 
Broadcast float32 vector Uf32(mt) into vector zmm1, under writemask. The 1, 2, or 4 bytes (depending on the 
conversion and broadcasted in effect) at memory address mt are broadcast and/or converted to a float32 
vector. The result is written into float32 vector zmm1. 

=  VBROADCASTSD 
Broadcast float64 vector Uf64(mt) into vector zmm1, under writemask. The 8 bytes at memory address mt are 
broadcasted to a float64 vector. The result is written into float64 vector zmm1. 


7.13.2 Alignment 


Regardless of which vector load or store form is used, all memory-based operations for vectors must be on properly 
aligned addresses. Intel? Xeon Phi™ coprocessor's vector architecture supports memory conversions as well as 
broadcast and subset of the data element, this can make address alignment confusing at first. A simple means of 
evaluating the alignment requirement is to use an equation: 


ali gnment=number, lement A6, "NY | 1) 


The number enis for vector load and store operations is set by the broadcast or subset modifier to the instruction. The 


number of elements is in the set of [1,4,16]. The size, : is based on the memory storage form, whether loading or 


lemen 
storing. Since the Intel® Xeon Phi™ coprocessor treats all data internally as 32-bit values, only the representation in 

memory determines what alignment (and how much cache bandwidth) is required. For example, with f1oat16 data, 
each element is two bytes in size in memory storage form. Table 7-6 summarizes the address alignment requirements. 


Table 7-6: Memory Alignment Requirements for Load and Store Operations 


Address Alignments 
Number of Elements 
Load Form | Store Form 
| 161016 all 64 bytes 


4 bytes 

161016 
dcba 8 bytes 

[61016 


Memory Storage Form 


Alignment in Bytes 
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The vector write mask is not consulted during memory alignment decisions — only the broadcast/subset modifier affects 
the total alignment constraints in addresses. A representative example for forcing address alignments in C or C++ source 
programs is shown in Figure 7-23 


| declspec(align(64)) // force 64-byte 
alignment 
struct Matrices 


{ 


float x[3 * DATASIZE]; // Vector input 
float A[9 * DATASIZE]; // 3x3 Matrix 
float r[3 * DATASIZE]; // vector result 
} Matrices [64]; 


Figure 7-23. Compiler Extension to Force Memory Alignment Boundaries in C or C++ Tools 


There is a corner case in the address alignment calculation where number, „as a term is always considered equal to 


lement. 
one. This case reduces the alignment constraint to the size of an element in memory storage form. This corner case 
applies to the vector pack and vector unpack instructions and to the vector gather and vector scatter instructions. 


7.13.3 Packing and Unpacking Operations 


There is another form of vector load and store operation that the Intel® Xeon Phi™ coprocessor presents: the pack 
(store) and unpack (load) instructions (alternatively known as compress and expand). The premise behind these 
constructs is that memory is always read contiguously, but that the contents read or written with respect to the vector 
register are not contiguous. In order to achieve this, the vector mask register is used in a slightly different manner than 
the write mask would normally be used. The vector unpack instruction behavior is shown in Figure 7-24. 
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Figure 7-24. The Behavior of the Vector Unpack Instruction 


The normal vector load operation reads all 16 elements from memory and overwrites the destination based on the 
mask, thereby skipping source elements in the serial sense. The unpack instruction keeps the serial ordering of elements 
and writes them sparsely into the destination. 


The first element (A) is read from memory, and the mask register is scanned to find the first bit that is set to the value 1, 
in a right-to-left ordering. The corresponding element of the mask register with the first bit set is over-written with the 
value loaded from memory (A). 
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The second element (B) is read from memory, and the scan of the mask register bits resumes from where it left off. The 


second mask bit that is set to the value 1 has a corresponding destination element value that is over-written with the 
value from memory (B). 


This process is continued until all of the elements that are masked "on" have their values replaced with a contiguous 
read from memory. The alternate naming of this instruction expand reflects the nature of the operation — expanding 
contiguous memory elements into noncontiguous vector elements. 


The equivalent instruction in vector store form is the pack instruction, shown in Figure 7-25. 


8£0VXx0 
v£0vxo 
o£0yx0 
8c0VX0 
vcovxo 
OC 
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8L0vxo 
800WX0 
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Figure 7-25: The Behavior of the Vector Pack Instruction 


Conceptually, these instructions are fairly straightforward in that memory is always read or written contiguously while 
the vector elements are used only if their corresponding mask bit is set. In actual implementation, however, these 
abstract instructions complicate the real vector pack and unpack instruction specification. Since these instructions are 
working with memory on an element-size basis (based on the memory storage form), the address alignment 
requirement does not include a factor for the number of elements being read or written. A challenge in implementation 
arises if the write mask is set to a pattern that would cause a starting address to walk across a cache line boundary. 


The Intel? Xeon Phi™ coprocessor uses two instructions to create the pack or unpack instructions. One instruction 
operates on the lower address cache line while the second instruction operates on the higher address cache line. They 
generate a dependency chain with long latencies (7 cycles per instruction)because they typically use the same 
destination register. Thus, vector unpack is actually comprised of two instructions: 


vloadunpackld zmm1:[:{k1}:],:ptr:[: {conversion }:] 


vloadunpackhd zmm1:[:{k1}:],:ptr+64:[: (conversion }:] 


The order of specifying these two instructions is irrelevant. While the write mask is optional, the vloadunpackhd 
version of the instruction requires you to specify +64 to the same memory address”. This requirement stems from 


another microarchitectural complexity in preserving power- and area-efficiency while automatically calculating the 
address. 


The unpack instruction can take a data conversion modifier, supporting the same choices that v1oadd does. However, 
the unpack instructions cannot take a broadcast modifier; they are restricted to data conversion modifiers. 


1 The +64 requirement does not change the expected behavior, as it is only applied to the "high" cache line operation. This is a 
requirement that stems from reducing the microarchitectural complexity. Programmers should overlook the presence of the +64 
accordingly, though failure to include it will produce incorrect results. 
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As with the conceptual unpack instruction, the vector pack instruction is actually comprised of two instructions: 1d and 
hd. This pair of instructions takes the same conversion modifiers as the vector store instruction: 


Vpackstoreld ptr: (k1):].:zmm1:[: conversion ) :] 
vpackstorehd ptr*64:[: (K1):],:zmm]1 :[:( conversion) :] 


By careful use of 64-byte aligned buffers in memory, it is possible for an optimized code sequence to execute just one of 
these instructions for pack or unpack. But it has been found that a common program bug is introduced when a 
previously aligned buffer becomes unaligned. Therefore, you should always use both the hd and 1d instructions, even 
when you believe it may not be required. 


A problem with using two instructions to achieve one conceptual result (whether pack or unpack), is that there is no 
guarantee of the code using these instructions. A coherence problem may appear (data race) with incorrect results if the 
data changes between the execution of the 1d and hd instructions. This would be a programmer error for failing to 
wrap the pack or unpack instructions in a critical section. 


The vector pack and unpack instructions are highly versatile and useful, and they also provide a way to relax the memory 
alignment constraints. Where vloadd and vstored require addresses to be aligned, the pack and unpack instructions 
only require alignment to the memory storage form. As long as the address to load from is aligned on a boundary for the 
memory storage form, then executing the pair of 1d and hd instructions loads the full 16 elements (with optional 
conversions) given a write-mask of OxFFFF. 


This technique bypasses the more strict alignment but it is limited in that it only provides identical results to the vector 
load and store operations when the write mask is set to OxFFFF. If the mask does not have every bit set, then the 
result of the two instruction sequence will have either data elements in the vector or data elements in memory re- 
ordered relative to the standard load and store instructions. 


7.13.4 Non-Temporal Data 


The non-temporal hint is one modifier that can be applied to every Intel? Xeon Phi™ coprocessor memory reference for 
vector load or store operations, as well as to software prefetching instructions. The NT hint is specified in the same 
convention as swizzle modifiers, denoted by {NT}. A typical example of the NT hint in conjunction with a regular 
vmovaps instruction is: 


Vmovaps:::[rsi]:{k5},:zmm29:{float16}:{dcba}: {NT} 


Cache hierarchies are designed to optimize efficiency by using the locality of data (and instructions) in a temporal 
stream. When any given datum is operated on, it is highly likely that the same datum will be used again in the near 
future. When operating on streaming data, however, the expectation of temporal locality is broken. With many 
streaming applications, data is referenced once, then discarded. 


When mixing data that has temporal locality with data that does not have temporal locality, caches become polluted in 
the sense that their efficiency is hampered. The cache replacement policy is what guides where old data is removed 
from the cache and new data is brought in. Policy management uses state machines and counters so that the cache 
implementation will track which data has been used most recently (MRU) and which data has been least recently used 
(LRU), with an ordered series of steps for each entry between these two points (LRU, MRU). 


When a cache is fully populated and a new datum needs to be inserted, the replacement policies center on the idea of 
LRU management. In this case, the LRU datum is evicted from the cache, and the new datum is placed where the LRU 
used to be. The tracking state for the MRU-LRU state is updated, so that the newest datum is now MRU and every other 
datum decreases in temporal locality by one step. This, in turn, creates a new LRU entry. 
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The pollution comes from streaming data making this MRU-LRU state tracking less effective. In essence, streaming data 
is non-temporal in that there is no expectation for rapid re-use. When streaming data is loaded into a typical cache 
system, it automatically becomes MRU and displaces the last LRU data. When many streaming references are accessed 
at once, this causes significant eviction of data that may be temporally useful, causing subsequent loads and stores to 
stall due to cache misses. 


To deal with this, the Intel? Xeon Phi™ coprocessor uses a non-temporal (NT) hint to override the default MRU-LRU state 
machine. When any load or store operation is tagged with the NT hint, it automatically forces the datum in the cache to 
become LRU; that is, next in line for eviction by subsequent operations. In the case of a software prefetch instruction, 
however, the datum is tagged MRU so that it is more likely to be present in the cache on the subsequent load or store 
operation for that datum. 


7.13.5 Prefetching 


Most modern microprocessors incorporate at least one hardware prefetching unit in the cache hierarchy to minimize 
the likelihood of misses in the L1 or L2 caches. The presence of hardware prefetchers typically increases the 
performance of a given workload by at least ten to fifteen percent for well-optimized programs, with some outliers 
demonstrating higher performance improvements. For programs that have not been extensively optimized and hand- 
tuned, a hardware prefetch unit is commonly capable of improving performance by fifty percent or more. Hardware 
prefetchers can be quite complex in their implementation, with more advanced implementations using adaptive tuning 
and predictive guesses for patterns observed in a series of cache accesses. This complexity requires a greater hardware 
design validation effort and post-silicon verification of behavior. 


The Intel? Xeon Phi™ coprocessor has a modified autonomous prefetching unit design that originated with the Intel® 
Pentium® 4. The Intel? Xeon Phi™ coprocessor also introduces a variety of software prefetching instructions, enabling 
you to tune your program’s behavior without interference from autonomous hardware prefetchers. 


7.13.5.1 L1 Prefetching 


It is not possible to use the prefetch instructions directly on the L1 instruction cache. Software controlled prefetching 
for L1 can only directly access the data cache in Intel? Xeon Phi™ coprocessor. 


Table 7-7. L1 Prefetch Instructions 


Instruction Cache Level Non-temporal Bring as exclusive 


VPREFETCHO L1 No No 


VPREFETCHNTA L1 Yes No 
VPREFETCHEO L1 No Yes 
VPREFETCHENTA L1 Yes Yes 


The prefetching instructions for L1 that prefetches at memory line in m8 are: 


=  VPREFETCHO më 
=  VPREFETCHNTA më 
=  VPREFETCHEO më 
=  VPREFETCHENTA m8 


The exclusive control bit tells the cache subsystem that the program is executing a data store to the target cache line. 
This causes the coherence engine to ensure that the cache line is set to either the Exclusive or the Modified state in the 
MESI protocol, fetching the data if it is not already there. The non-temporal (NT) control bit is processed; however, the 
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NT control with a prefetch always sets the LRU state machine to indicate MRU status so that the prefetched data is not 
lost. The misshint (MH) control indicates that the target cache line is not expected to be resident in the cache, and that 
the hardware scheduler should not reschedule this thread until the data has been loaded. This uses the same internal 
mechanism as the delay instruction. Any valid bit-wise OR combination of these controls bits (or none) may be set in 
the hints field of the instruction. 


This instruction is considered to be a microarchitecture hint, and the hardware may defer or drop execution. If the 
requested cache line containing the specified address is already in the L1 data cache, the prefetch is dropped. Similarly, 
any attempts to prefetch uncacheable or WC memory are ignored. 


7.13.5.2 L2 Prefetching 


The Intel? Xeon Phi™ coprocessor supports the software-controlled prefetching of code or data directly to the L2 cache. 
The prefetching instructions for L2 are listed in Table 7-8. 


Table 7-8. L2 Prefetch Instructions 


Instruction Cache Level Non-temporal Bring as exclusive 


VPREFETCH1 L2 No No 


VPREFETCH2 L2 Yes No 
VPREFETCHE1 L2 No Yes 
VPREFETCHE2 L2 Yes Yes 


The exclusive control bit tells the cache subsystem that the program is executing a data store to the target cache line. 
This causes the coherence engine to ensure that the cache line is set to either the Exclusive or the Modified state in the 
MESI protocol, fetching the data if it is not already there. The non-temporal (NT) control bit is processed; however, the 
NT control with a prefetch always sets the LRU state machine to indicate MRU status so that the prefetched data is not 
lost. Any valid bit-wise OR combination of these controls bits (or none) may be set in the hints field of the instruction. 


This instruction is considered to be a microarchitecture hint, and the hardware may defer or drop execution. If the 


requested cache line containing the specified address is already in the L2 data cache, the prefetch is dropped. Similarly, 
any attempts to prefetch uncacheable or WC memory are ignored. 


7.14 New Instructions 


Please refer to the (Intel? Xeon Phi™ Coprocessor Instruction Set Reference Manual (Reference Number: 327364)) for a 
complete description of the instruction set supported. The following instructions are those added most recently. 


7.14.1 Mask Manipulation Instructions 


Mask-manipulation instructions that are supported are listed in the following table. 


Table 7-9. Mask Manipulation Instructions 


Instruction Description 

vkmov k2, [mem] Loads 16 bits from a 16b aligned memory address into a mask register. 

vkinsert k2, rax, imm8 Loads 16 bits from any quadrant of GPR rax specified by imm8. 

vkextract rax, k2, imm8 Stores 16 bits from any quadrant of GPR rax specified by imm8, and the rest of the rax is 
set to zero. 
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Instruction Description 


vkmovhd rax, k2, k3 Stores 32 bits from mask registers into the higher half of the GPR rax and zeros the lower 


half of the GPR rax. 


vkmovld rax, k4, k5 Stores 32 bits from mask registers into lower half of the GPR rax and zeros the higher half 


of the GPR rax. 


7.14.2 Packed Typeless Instructions 


Table 7-10 lists the supported packed typeless instructions. 


Table 7-10. Packed Typeless Instructions 


Instruction 


Description 


vstoredhi [mem] {k1}, v1 


Stores the higher eight 32-bit elements from vector v1 to memory location 
[mem], under writemask k1. 


vstoredlo [mem] {k1}, v1 


Stores the lower eight 32-bit elements from vector v1 to memory location 
[mem], under writemask k1. 


vpermd vO {k1}, v2, vi/mem 


v1/mem is the register or memory source data to be permuted, v2 is the 
16 4-bit index or desired permuted location, and vO is the destination 
register. v2 can contain same indexes or same permuted location. 
VPERMD is implemented using the generic swizzle mux. 


Vi/mem |PIOINIML K|J| I HIGIF EDICIBI|A 
ve [raris [613 40152138 le Miao] 


valignd vO {k1}, v2, v1/mem, Imm8 


v1/mem is the register or memory source data, v2 is the second register 
source, imm8 specifies the shift amount (bottom 2 bits are zeroed out for 
finer valign instructions in the future; only bits 5:2 are used), and vO is the 
destination register. 


V2 


V1/mem 


vscatterq [rax+v2] {k1}, v1 


Quadword scatter of vector v1 to vector memory location [rax+v2], under 
writemask k1. It stores eight 64-bit elements from v1 to memory based on 
the lower eight 32-bit indexes in v2 given base address [rax]. 


vgatherq v1{k1}, [rax+v2] 


Quadword gather from memory location [rax+v2] into quadword vector 
v1, under writemask k1. It loads eight 64-bit elements from memory to v1 
based on the lower 256-bit (eight 32-bit) indexes in v2 given base address 
[rax]. 
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Instruction Description 


vgetexppq v1 (k1], Sf(v2/mt), immd8 Extracts int64 vector of exponents from float64 vector Sf(v2/mt) and 
stores the result in v1, under writemask. Float64 vector inputs are 
normalized according to immd8[1:0] and the sign of the exponents is 
returned as int64. If the source is SNaN, QNaN will be returned. If the 
source is +INF, source will be returned. 


7.14.3 New Packed SP FP Instructions 
The new packed SP and FP instructions are: 


e  vrcp23ps v1 {k1}, vO // y = minimax quad approx recip(x) 

e  vsqrt23ps v1 {k1}, vO // y = minimax quad approx rsqrt(x) 

e  vlog2ps v1 {k1}, vO // y = minimax quad approx log2(x) 

e  vexp2ps vi {k1}, v2 // y  minimax quad approx exp2(f)*2^l 

e  vcvtps2pi v2 {k1}, vO, 0, 8 // t convert to fixed point (X = I+f) 

e  vabsdiffps 
The absolute difference between two single-precision floating-point numbers by zeroing the sign bit of the result 
on a floating point subtraction. 

e  vgetmantps v1 {k1}, Sf(v2/mt), immd8 
Extracts float32 vector of mantissas from vector Sf(v2/mt) and stores the result in v1, under writemask. Vector 
float32 inputs are normalized according to immd8[1:0] and the sign of the mantissa results is set according to 
immd8[3:2] 

e VFIXUPNANPS 
Adds NaN source propagation. 


7.14.4 New Packed Double-Precision Floating-Point Instructions 


Table 7-11 describes the most recently introduced double-precision floating-point instructions. 


Table 7-11. New Packed DP FP Instructions 


Instruction Description 


vroundpd v1 {k1}, v2/mem, RC, expadj This instruction rounds float64 vector Sf (v2/mt) and stores the result in v1, 
using RC as rounding control, and expadj to optionally adjust the exponent 
before conversion to nearest integer or fixed point value representation in 
float64, under writemask. VROUNDPD performs an element-by-element 
rounding of the result of the swizzle/broadcast/conversion from memory or 
float64 vector v2. The rounding result for each element is a float64 containing 
an integer or fixed-point value, depending on the value of expadj. The 
direction of rounding depends on the value of RC. The result is written into 
float64 vector v1. This instruction doesn’t actually convert the result to an 
int64; the results are float64s, just like the input, but are float64s containing 
the integer or fixed-point values that result from the specified rounding and 
scaling. 
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Instruction Description 


vclampzpd v1 {k1}, v2, Sf (v3/mt) This instruction clamps float64 vector v2 between zero and float64 vector Sf 
(v3/mt) and stores the result in v1, under writemask. VCLAMPZPD performs 
an element-by-element clamp of float64 vector v2 to the range between zero 
and the float64 vector result of the swizzle/broadcast/conversion process on 
memory or float64 vector v3. The result is written into float64 vector v1. Note 
that this instruction behaves differently from VCLAMPZPD when the third 
argument is negative. The order of Max followed by Min is required for 
positive clamping values. 


VFIXUPPD One with NaN propagation FIXUPNANPD; one without NaN propagation 
FIXUPPD. 


vgetmantpd v1 {k1}, Sf(v2/mt), immd8 Extracts float64 vector of mantissas from vector Sf(v2/mt) and stores the 
result in v1, under writemask. Vector float64 inputs are normalized according 
to immd8[1:0] and the sign of the mantissa results is set according to 
immd8[3:2]. 


7.14.5 New Packed Int32 Instructions 


Table 7-12 describes the recently introduced int32 instructions. 


Table 7-12. Packed Int32 Instructions 


Instruction Description 


VABSDIFPI Determines the absolute difference between two 32-bit integer numbers either 
by leaving the result as is when subtract result is positive, or by inverting and 
adding 1 to the result when subtract result is negative. 


vgetexppi v1 {k1}, Sf(v2/mt), immd8 Extracts int32 vector of exponents from float32 vector Sf(v2/mt) and stores the 
result in v1, under writemask. Float32 vector inputs are normalized according to 
immd8[1:0] and the sign of the exponents is returned as int32. If the source is 
SNaN, then QNaN will be returned. If the source is +INF, then the source will be 
returned. 
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8 Glossary and Abbreviations 


Term Description 


GBox memory controller 
GDDR5 Graphics Double Data Rate, version 5 
GOLS Globally Owned, Locally Shared protocol 


ABI Application Binary Interface 

l Autonomous Compute Node 

AGI Address Generation Interlock 

AP Application Program 

API Application Programming Interface 
APIC Advanced Programmable Interrupt Controller 
BA Base Address 

BLCR* Berkeley Lab Checkpoint Restore 
BMC Baseboard Management Controller 
BSP Bootstrap Processor 

CL* open Computing Language 

CLD Cache Line Disable 

CMC Channel Memory Controller 

COI Coprocessing Offload Infrastructure 
CPI Carry-Propagate Instructions 

CPU Central Processing Unit 

C/R Check and Restore 

CRI Core Ring Interface 

C-state Core idle state 

CSR Configuration Status Register 

DAPL Direct Access Programming Library 
DBS Demand-Based Scaling 

DMA Direct Memory Access 

DTD Distributed Tag Directory 

DP Dual Precision 

ECC Error Correction Code 

EMU Extended Math Unit 

EMON Event Monitoring 

ETC Elapsed Time Counter 

FBox Part of the GBox, the FBox is the interface to the ring interconnect. 
FMA Fused Multiply and Add 

FMS Fused Multiply Subtract 

FPU Floating Point Unit 


Graphics Double Data Rate 
Global Descriptor Table 
General Protection 


Host Channel Adaptor 
High Performance Computing 
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Term 


IBHCA 
ID 


IpolB 
iWARP 
Intel? MPSS 
I/O 

IOAPIC 

ISA 


LAPIC 
LKM 


MCE 


MKL 


Description 
Head Pointer Index 
(*i-squared cee" or "i-two cee”) Inter-Integrated Circuit 
Intel Architecture 
InfiniBand* 
InfiniBand* Host Communication Adapter 
Identification 


INVLPG Invalidate TBL Entry 


Internet Protocol over InfiniBand* 


IPMI Intelligent Platform Management Interface 


Internet Wide Area RDMA Protocol 


Intel? Manycore Platform Software Stack 
Input/Output 

Input/Output Advanced Programmable Interrupt Controller 
Instruction Set Architecture 

Local Advanced Programmable Interrupt Controller 
Loadable Kernel Modules 

Least Recently Used 

Linux* Standard Base 

The request scheduler of the GBox. 

Machine Check Architecture 

Machine Check Exception 


MESI Modified, Exclusive, Shared, Invalid states 


Modified, Owner, Exclusive, Shared, Invalid states 


Intel? Math Kernel Library 
Memory-Mapped Input/Output 


sS 
= 
ö 


Message Passing Interface 
Intel? Many Integrated Core Architecture Platform Software Stack 
Most Recently Used 


Model-Specific Register or Machine-Specific Register 
Non-Temporal 


MTRR Memory Type Range Register 


multiplexor 
Intel? Mine Yours Ours Shared Virtual Memory 
Network File System 


OpenCL* Open Computing Language 


Open Fabrics Alliance 


OFED* Open Fabrics Enterprise Distribution 


Power Control 
Platform Controller Hub 


PCI Express* Peripheral Component Interconnect Express 


Power Control Unit 
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Term Description 


PF 
PHP scripts 


P54C Intel? Pentium? Processor 


PM 


PMON Performance Monitoring 


P-state Performance level states 
RDMA Remote Direct Memory Access 


SAE 
SBox 


SCIF Symmetric Communication Interface 


SC (SCM) protocol 


SDP Software Development Platform 


Sysfs a virtual file system provided by Linux* 


uDAPL 


coprocessor OS Micro Operating System 


verbs 
VIA 
VMM 


Psge Directory Entry 
Picker Function 


Power Management or Process Manager 


Performance Monitoring Unit 
Plug and Play 
Power-On Self-Test 


Reliability Accessibility Serviceability 


Read For Ownership 

Remote Memory Access 

Ring Stack 

Suppress All Exceptions 

System Box (Gen2 PCI Express* client logic) 


Socket Connection Management 


Software Development Vehicle 

SEP is a utility that provides the sampling functionality used by VTune analyzer 
Shared Memory 

Intel? Xeon Phi™ Coprocessor System Interface 
Single Instructions, Multiple Data 

Server Management 

System Management Controller 

Symmetric Multiprocessor 

System Memory Page Table 

Single Precision 

Streaming SIMD Extensions 

Secure Shell 

System V Interface Definition 


Transaction Control Unit 
Tail Pointer Index 
Timestamp Counter 

Tag Directory 

Translation Lookaside Buffer 
Thermal Monitoring Unit 
Timestamp Counter 
Uncacheable 

User DAPL 


A programming interface 
Virtual Interface Architecture 


Virtual Machine Manager 
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Term Description 
P : 
WT 


W Write Protect 
Write Through 
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Appendix: SBOX Control Register List 


OC I2C ICR 


OC I2C ISR 


OC I2C ISAR 


OC I2C IDBR 


OC I2C IDMR 


THERMAL STA 
TUS 


THERMAL INTE 
RRUPT ENABLE 


MICROCONTRO 
LLER FAN STA 
TUS 

STATUS FAN1 


STATUS FAN2 


SPEED OVERRI 
DE FAN 
BOARD TEMP1 


BOARD TEMP2 


BOARD VOLTA 
GE SENSE 


CURRENT. DIE - 
TEMPO 


CURRENT. DIE - 
TEMP1 


CURRENT. DIE - 
TEMP2 


MAX DIE TEM 
PO 


1000 


1004 


1008 


100C 


1010 


1018 


101C 


1020 


1024 


1028 


102C 


1030 


1000 


1004 


1008 


100C 


1010 


1018 


101C 


1020 


1024 


1028 


102C 


1030 


4096 


4100 


4104 


4108 


4112 


4120 


4124 


4128 


4132 


4136 


4140 


4144 


0400 


0401 


0402 


0403 


0404 


0406 32 


0407 32 


0408 32 


0409 32 
040A 32 


040B 32 


040C 


1 


1 


1 


1 


1 


1 


1 


Ring 0 


Ring 0 


Ring 0 


Ring 0 
Ring 0 


Ring 0 


Ring 0 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 
Paging Yes 


Paging Yes 


Paging Yes 


Paging 


Paging 


Paging 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


RTL 


RTL 


RTL 


RTL 


RTL 


PERST TRM,I 


2C 


PERST TRM,I 


2C 


PERST TRM,I 


2C 


PERST TRM,I 


2C 


PERST TRM,I 


2C 


PERST TRM 


Cep G TRM 
RPA 


PERST TRM 


PERST TRM 


PERST TRM 


I2C Control Register for 
coprocessor Over- 
clocking Unit 

I2C Status Register for 
coprocessor Over- 
clocking Unit 

I2C Slave Address 
Register for coprocessor 
Over-clocking Unit 

I2C Data Buffer Register 
for coprocessor Over- 
clocking Unit 

I2C Bus Monitor Register 
for coprocessor Over- 
clocking Unit 

Status and Log info for all 
the thermal interrupts 


Register that controls the 
interrupt response to 
thermal events 


Upto data Status 
information from the Fan 
microcontroller 

32 bit Status of Fan #1 


32 bit Status of Fan #2 


RTL CSR G TRM 32 bit Status of Fan #2 
RPA 
PERST TRM 


RTL 


Temperature from 
Sensors 1 and 2 on 
coprocessor card 
Temperature from 
Sensors 3 and 4 on 
coprocessor card 
Digitized value of Voltage 
sense input to 
coprocessor 

Consists of Current Die 
Temperatures of sensors 
Othru2 


Consists of Current Die 
Temperatures of sensors 
3thru 5 


Consists of Current Die 
Temperatures of sensors 
6thru 8 


Consists of Maximum Die 
Temperatures of sensors 
Othru2 
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MAX DIE TEM 
P1 


MAX DIE TEM 
P2 


MIN. DIE TEMP 
0 


MIN. DIE TEMP 
1 


MIN. DIE TEMP 
2 


THERMAL CON 
STANTS 


THERMAL TEST 


GPU HOT CON 
FIG 

NOM PERF M 

ON 

PMU PERIOD - 
SEL 

ELAPSED TIME 
` LOW 


ELAPSED TIME 
- HIGH 


THERMAL STA 
TUS INTERRUP 
T 

THERMAL STA 
TUS 2 
THERMAL TEST 
2 

EXT TEMP SET 
TINGSO 


EXT TEMP SET 
TINGS1 


EXT TEMP SET 
TINGS2 


EXT TEMP SET 
TINGS3 


EXT TEMP SET 
TINGS4 


EXT_TEMP_SET 
TINGSS 


EXT_CONTROLP 
ARAMSO 


EXT_CONTROLP 
ARAMS1 


104C 


1050 


1054 


1058 


105C 


1060 


1064 


1068 


106C 


1070 


1074 


1078 


107C 


1080 


1084 


1090 


104C 


1050 


1054 


1058 


105C 


1060 


1064 


1068 


106C 


1070 


1074 


1078 


107C 


1080 


1084 


1090 


4172 


4176 


4180 


4184 


4188 


4192 


4196 


4200 


4204 


4208 


4212 


4216 


4220 


4224 


4228 


4240 


0413 


0414 


0415 


0416 


0417 


0418 


0419 


041A 


041B 


041C 


041D 


041E 


041F 


0420 


0421 


0424 


32 


32 


32 


32 


32 


32 


32 


32 


32 


1 


1 


1 


1 


1 


1 


1 


1 


1 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging Yes Lock RTL 


Paging Yes 


Paging Yes 


Paging No 


Paging No 


Paging Yes 


Paging Yes 


Paging No 


Paging No 


Paging 


Paging 


Paging 


Paging 


able 
Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


RTL CSR G TRM 
RPA 


RTL CSR G TRM 
RPA 


RTL CSR G TRM 
RPA 


RTL CSR G TRM 
RPA 


RTL CSR G TRM 
RPA 


RTL CSR G TRM 
RPA 


RTL CSR G TRM 
RPA 


PERST TRM 


RTL PERST TRM 


RTL PERST TRM 


RTL PERST TRM 


RTL PERST TRM 


RTL PERST TRM 


RTL PERST TRM 


RTL CSR G TRM 
RPA 
RTL PERST TRM 


Consists of Maximum Die 
Temperatures of sensors 
3thru 5 


Consists of Maximum Die 
Temperatures of sensors 
6thru 8 


Consists of Minimum Die 
Temperatures of sensors 
Othru2 


Consists of Minimum Die 
Temperatures of sensors 
3thru 5 


Consists of Minimum Die 
Temperatures of sensors 
6thru 8 


Constants that define 
thermal response 


System Interrupt Cause 
Set Register 1 


Configuration CSR for 
GPU HOT 

Nominal Performance 
Monitors 

PMU period 


"Elapsed Time Clock" 
Timer - lower 32 bits 


"Elapsed Time Clock" 
Timer - higher 32 bits 


Status and Log info for 
coprocessor new thermal 
interrupts 

Thermal Status for 
coprocessor 

Thermal Testability for 
coprocessor 

External Thermal Sensor 
Setting - Sensor #0 


External Thermal Sensor 
Setting - Sensor #1 


External Thermal Sensor 
Setting - Sensor #2 


External Thermal Sensor 
Setting - Sensor #3 


External Thermal Sensor 
Setting - Sensor #4 


External Thermal Sensor 
Setting - Sensor #5 


External Thermal Sensor 
Parameters - Sensor HO 


External Thermal Sensor 
Parameters - Sensor #1 
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EXT. CONTROLP 
ARAMS2 


EXT. CONTROLP 
ARAMS3 


EXT. CONTROLP 
ARAMS4 


EXT_CONTROLP 
ARAMS5 


EXT. TEMP. STA 
TUSO 


EXT. TEMP. STA 
TUS1 


INT FAN STAT 
US 

INT FAN CONT 
ROLO 


INT FAN CONT 
ROL1 


INT FAN CONT 
ROL2 


FAIL SAFE STA 
TUS 


FAIL SAFE OFF 
SET 

SW OVR CORE 
- DISABLE 


CORE DISABLE 
STATUS 
FLASH COMPO 
NENT 

INVALID INSTR 
0 

INVALID INSTR 
1 

JEDECID 


VENDOR COM 
P. CAPP 


POWER ON ST 
ATUS 
VALID INSTRO 


VALID INSTR1 


10B0 


10B4 


10B8 


10BC 


10C0 


10C4 


10C8 


10CC 


10D0 


10D4 


2000 


2004 


2008 


2010 


2018 


2020 


2024 


2030 


2034 


2038 


2040 


2044 


10B0 


10B4 


10B8 


10BC 


10CO 


10C4 


10C8 


10CC 


10D0 


10D4 


2000 


2004 


2008 


2010 


2018 


2020 


2024 


2030 


2034 


2038 


2040 


2044 


4272 


4276 


4280 


4284 


4288 


4292 


4296 


4300 


4304 


4308 


8192 


8196 


8200 


8208 


8216 


8224 


8228 


8240 


8244 


8248 


8256 


8260 


042C 


042D 


042E 


042F 


0430 


0431 


0432 


0433 


0434 


0435 


0800 


0801 


0802 


0804 


0806 


0808 


0809 


080C 


080D 


080E 


0810 


0811 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging No 


Paging No 


Paging 


Paging 


Paging 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Yes RTL 


Yes RTL 


Yes FLASH 


Yes OTHE 
R 


Yes FLASH 


Yes FLASH 


Yes FLASH 


Yes FLASH 


Yes FLASH 


Yes FLASH 


Yes FLASH 


Yes FLASH 


Yes FLASH 


TRM 


TRM 


TRM 


TRM 


TRM 


TRM 


PERST TRM 


PERST TRM 


TRM 


TRM 


PERST TRM 


PERST TRM 


PERST TRM 


PERST TRM 


PERST TRM 


PERST TRM 


PERST TRM 


PERST TRM 


PERST TRM 


PERST TRM 


PERST TRM 


PERST TRM 


External Thermal Sensor 
Parameters - Sensor #2 


External Thermal Sensor 
Parameters - Sensor #3 


External Thermal Sensor 
Parameters - Sensor #4 


External Thermal Sensor 
Parameters - Sensor #5 


External Thermal Sensor 
Status - Sensor #0 ^ #2 


External Thermal Sensor 
Status - Sensor #3 ~ #5 


Internal Thermal Sensor 
Status 

Internal Thermal Sensor 
Setting/Parameters and 
FCU Configuration - 0 


Internal Thermal Sensor 
Setting/Parameters and 
FCU Configuration - 1 


Internal Thermal Sensor 
Setting/Parameters and 
FCU Configuration - 2 


Fail Safe Image and 
Repair Status register 


Fail Safe Offset register 


Software controlled core 
Disable register that says 
how many cores are 
disabled - deprecated 


core Disable status 
register 

Flash Component 
register 

Invalid Instruction 
register 

Invalid Instruction 
register 

JEDEC ID register. This is 
a SW only register, SPI 
Controller reads these 
bits from the flash 
descriptor and reports 
the values in this 
register. 

Vendor Specific 
component capabilities 
register. This is a SW only 
register, SPI Controller 
reads these bits from the 
flash descriptor and 
reports the values in this 
register. 

Power On status register 


Scratch 


Scratch 
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VALID INSTR2 


VALID INSTR T 
YPO 


VALID INSTR T 
YP1 


VALID INSTR T 
YP2 


HW SEQ STAT 
US 


FLASH ADDR 


FLASH DATA 


SW SEQ STAT 
US 


SW SEQ CTRL 


OPCODE TYP C 
ONFIG 


OPCODE MEN 
UO 


OPCODE MEN 
U1 


2048 


2050 


2054 


2058 


2070 


2078 


2090 


20B0 


20B4 


20B8 


2048 


2050 


2054 


2058 


2070 


2078 


20AC 


20B0 


8264 


8272 


8276 


8280 


8304 


8312 


8336 


8368 


8372 


0812 


0814 


0815 


0816 


081C 


081E 


0824 


082C 


082D 


32 


32 


32 


32 


32 


32 


32 


32 


1 


1 


1 


1 


1 


1 


8 


1 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Yes FLASH 


Yes FLASH 


Yes FLASH 


Yes FLASH 


Yes RTL 


Lock RTL 
able 


Lock RTL 
able 


Lock Flash, 
able RTL 


Lock RTL 
able 


Lock RTL 
able 


Lock RTL 
able 


Lock RTL 
able 


PERST TRM 


PERST TRM 


PERST TRM 


PERST TRM 


PERST TRM 


PERST TRM 


PERST TRM 


PERST TRM 


TRM 


TRM 


TRM 


TRM 


Scratch 


Scratch 
Scratch 
Scratch 


HW Sequence Flash 
Status Register 


This is the starting byte 
linear address of a SPI 
read/write/erase 
command 


Flash Data registers 


SW Sequence Flash 
Status Register 


SW Sequence Flash 
Control Register 


This register specifies 
information the type of 
opcode. Entries in this 
register correspond to 
the entries in the 
Opcode Menu 
Configuration register. 


This register lists the 
allowable opcodes. 
Software programs an 
SPI opcode into this field 
for use when initiating 
SPI commands through 
the Control Register. 
Flash opcodes that are 
tagged as invalid via the 
flash descriptor will 
immediate assert the 
PERR bit in the Hardware 
Sequencing Flash Status 
register. 


This register lists the 
allowable opcodes. 
Software programs an 
SPI opcode into this field 
for use when initiating 
SPI commands through 
the Control Register. 
Flash opcodes that are 
tagged as invalid via the 
flash descriptor will 
immediate assert the 
PERR bit in the Hardware 
Sequencing Flash Status 
register. 
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OPCODE MEN 
U2 


FAIL SAFE REP 
AIR. OFFSET 


AGENT. DISABL 
E FLASHO 


AGENT. DISABL 
E FLASH1 


AGENT. DISABL 
E FLASH2 


AGENT. DISABL 
E FLASH3 


AGENT. DISABL 
E_FLASH4 


AGENT_DISABL 
E FLASH5 


SW. OVR AGEN 
T. DISABLEO 


SW OVR AGEN 
T DISABLE1 


SW OVR AGEN 
T. DISABLE2 


SW OVR AGEN 
T DISABLE3 


SW OVR AGEN 
T DISABLE4 


SW OVR AGEN 
T. DISABLES 


SPI FSM 


SOFT STRAP 0 


Paging No Lock 
able 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging No Yes FLASH 


Paging No Lock 
able 


PERST TRM 


PERST TRM 


CSR G TRM 
RPA 


This register lists the 
allowable opcodes. 
Software programs an 
SPI opcode into this field 
for use when initiating 
SPI commands through 
the Control Register. 
Flash opcodes that are 
tagged as invalid via the 
flash descriptor will 
immediate assert the 
PERR bit in the Hardware 
Sequencing Flash Status 
register. 


Fail Safe Offset for Repair 
Sector register 


Agent Disable Value from 
Flash register 


Agent Disable Value from 
Flash register 


Agent Disable Value from 
Flash register 


Agent Disable Value from 
Flash register 


Agent Disable Value from 
Flash register 


Agent Disable Value from 
Flash register 


Software controlled (and 
TAP override) Agent 
Disable register that says 
how many agent are 
disabled 


Software controlled (and 
TAP override) Agent 
Disable register that says 
how many agent are 
disabled 


Software controlled (and 
TAP override) Agent 
Disable register that says 
how many agent are 
disabled 


Software controlled (and 
TAP override) Agent 
Disable register that says 
how many agent are 
disabled 


Software controlled (and 
TAP override) Agent 
Disable register that says 
how many agent are 
disabled 


Software controlled (and 
TAP override) Agent 
Disable register that says 
how many agent are 
disabled 


SPI FSM Status register 


soft strap registers - 0 
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SOFT STRAP 1 


SOFT STRAP 2 


SOFT STRAP 3 


SOFT STRAP A 


SOFT STRAP 5 


SOFT STRAP 6 


SOFT STRAP 7 


STARV THRSH 


TX ARB. PRIOR 


TX_ARB_STRV_ 
PRIOR 


TX_BURST_LEN 
PCI EP DBG C 


APT 
TX ARB FIX 


GH SCRATCH 
EP. TX CONTR 
OL 

DFTUR 

SPARE 


RX_HALT 


MCX_CTL_LO 
MCX_STATUS_L 
o 

MCX STATUS - 
HI 

MCX ADDR LO 
MCX. ADDR HI 


MCH MISC 


MCX MISC2 


SMPTOO 


SMPTO1 


SMPTO02 


SMPTO03 


2404 


2408 


240C 


2410 


2414 


2418 


241C 


3010 


3014 


3018 


301C 


3024 


3028 


303C 


3044 


304C 


3050 


306C 


3090 


3098 


309C 


30A0 


30A4 


30AC 


2404 


2408 


240C 


2410 


2414 


2418 


241C 


3010 


3014 


3018 


301C 


3024 


3028 


303C 


3044 


304C 


3050 


306C 


3090 


3098 


309C 


30A0 


30A4 


30A8 


30AC 


9220 


9224 


9228 


9232 


9236 


9240 


9244 


12304 


12308 


12312 


12316 


12324 


12328 


12348 


12356 


12364 


12368 


12396 


12432 


12440 


12444 


12448 


12452 


12456 


12460 


12544 


12548 


12552 


12556 


0901 


0902 


0903 


0904 


0905 


0906 


0907 


0C04 


0C05 


0C06 


0C07 


0C09 


0COA 


OCOF 


OC11 


0C13 


OC14 


OC1B 


0C24 


0C26 


0C27 


0C28 


0C29 


0C2A 


0C2B 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Paging No Lock 
able 
Paging No Lock 
able 
Paging No Lock 
able 
Paging No Lock 
able 
Paging No Lock 
able 
Paging No Lock 
able 
Paging No Lock 
able 
Paging No Yes 
Paging No Yes 


Paging No Yes 


Paging No Yes 
Paging No Yes 


Paging No Yes 


Paging No Yes 
Paging No Yes 
Paging No Yes 
Paging No Yes 


Paging No Yes 


Paging Yes Yes 
Paging Yes Yes 
Paging Yes Yes 
Paging Yes Yes 
Paging Yes Yes 


Paging No Yes 


Paging 


Paging Yes/ 
No* 


Paging 


Paging 


FLASH 


FLASH 


FLASH 


FLASH 


FLASH 


FLASH 


FLASH 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


Cep G TRM 
RPA 
CSR G TRM 
RPA 
CSR G TRM 
RPA 
CSRG TRM 
RPA 
CSRG TRM 
RPA 
CSR G TRM 
RPA 
CSR G TRM 
RPA 
PERST — CRU,T 
RM 


PERST ` CRU,T 
RM 
CRU,T 


RM 


PERST 


PERST ` CRU,T 
RM 
CRU,T 
RM 
CRU,T 
RM 


PERST 


PERST 


PERST ` CRU,T 
RM 
CRU,T 
RM 
CRU,T 
RM 
CRU,T 
RM 
CRU,T 
RM 


PERST 


PERST 


PERST 


PERST 


PERST ` CRU,T 
RM 
CRU,T 
RM 
CRU,T 
RM 
CRU,T 
RM 
CRU,T 
RM 


PERST 


PERST 


PERST 


PERST 


soft strap registers - 1 
soft strap registers - 2 
soft strap registers - 3 
soft strap registers - 4 
soft strap registers - 5 
soft strap registers - 6 
soft strap registers - 7 


Starvation thresholds per 
T2P FIFO. 


Transmit Arbiter 
Priorities 

Transmit Arbiter 
Starvation Priorities. 


Transmit Burst Length 
Per FIFO 


PCI endpoint debug 
capture 

Fixes Tx arbiter onto one 
selection 


Scratch register 


Endpoint Transmit 
Control 
Chicken Bits for GHost. 


Spares for H/W 
debug/fixes. 

Halts reception of all PCI 
packets. 


MCX CTL LOW 
MCX Status 
MCX Status HI 
MCX Addr Low 
MCX Addr High 


Machine Check 
Miscellaneous #1 


Machine Check 
Miscellaneous #2 


System Memory Page 
Table, Page 00. *Not 

Host accessible on A- 
step. 

System Memory Page 
Table, Page 01. 


System Memory Page 
Table, Page 02. 


System Memory Page 
Table, Page 03. 
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SMPT04 


SMPTO5 


SMPTO6 


SMPTO07 


SMPT08 


SMPTO9 


SMPT10 


SMPT11 


SMPT12 


SMPT13 


SMPT14 


SMPT15 


SMPT16 


SMPT17 


SMPT18 


SMPT19 


SMPT20 


SMPT21 


SMPT22 


SMPT23 


SMPT24 


SMPT25 


SMPT26 


SMPT27 


SMPT28 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


System Memory Page 
Table, Page 04. 


System Memory Page 
Table, Page 05. 


System Memory Page 
Table, Page 06. 


System Memory Page 
Table, Page 07. 


System Memory Page 
Table, Page 08. 


System Memory Page 
Table, Page 09. 


System Memory Page 
Table, Page 10. 


System Memory Page 
Table, Page 11. 


System Memory Page 
Table, Page 12. 


System Memory Page 
Table, Page 13. 


System Memory Page 
Table, Page 14. 


System Memory Page 
Table, Page 15. 


System Memory Page 
Table, Page 16. 


System Memory Page 
Table, Page 17. 


System Memory Page 
Table, Page 18. 


System Memory Page 
Table, Page 19. 


System Memory Page 
Table, Page 20. 


System Memory Page 
Table, Page 21. 


System Memory Page 
Table, Page 22. 


System Memory Page 
Table, Page 23. 


System Memory Page 
Table, Page 24. 


System Memory Page 
Table, Page 25. 


System Memory Page 
Table, Page 26. 


System Memory Page 
Table, Page 27. 


System Memory Page 
Table, Page 28. 
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SMPT29 


SMPT30 


SMPT31 


PDAT 


SDAT 


DATOUT 


DATINO 


TAP IDCODE 


TAP_SUBSTEP 


CR_RX_BUF_ST 
S 

CR RX LFSR FT 
S 

CR P CONSUM 
ED 


CR NP CONSU 
MED 


CR CPL CONSU 
MED 


CR P LIMIT 


CR NP. LIMIT 


CR CPL LIMIT 


CR P ALLOCAT 


ED 


CR NP. ALLOCA 
TED 


CR CPL ALLOC 
ATED 


PSMI ERR STA 
TUS 

PSMI CONFIG 
PSMI SEQ NU 


REAL TIME CL 


3174 


3178 


317C 


3200 


3208 


3210 


3218 


3A00 


3A04 


3A08 


3A0C 


3A10 


3A14 


3A18 


3A1C 


3A20 


3A24 


3A28 


3A2C 


3A30 


3A34 


3A38 


3A3C 


3A44 


4010 


4014 


3174 


3178 


317C 


3200 


3208 


3210 


3218 


3A00 


3A04 


3A08 


3A0C 


3A10 


3A14 


3A18 


3A1C 


3A20 


3A24 


3A28 


3A2C 


3A30 


3A34 


3A38 


3A3C 


3A44 


4010 


4014 


12660 


12664 


12668 


12800 


12808 


12816 


12824 


14848 


14852 


14856 


14860 


14864 


14868 


14872 


14876 


14880 


14884 


14888 


14892 


14896 


14900 


14904 


14908 


14916 


16400 


16404 


OC5D 


OC5E 


OC5SF 


0C80 


0C82 


0C84 


0C86 


OE80 


OE81 


OE82 


OE83 


OE84 


OE85 


OE86 


OE87 


OE88 


OE89 


OE8A 


OE8B 


OE8C 


OE8D 


OE8E 


OE8F 


OE91 


1004 


1005 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Paging Yes/ 
No* 


Paging Yes/ 
No* 


Paging Yes/ 
No* 


Paging Yes Lock RTL 


able 


Paging Yes Lock RTL 


able 


Paging Yes Lock RTL 


able 


Paging Yes Lock RTL 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging 


Paging 


Paging 


Paging 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


able 
Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


PERST CRU,T 


RM 


PERST CRU,T 


RM 


PERST CRU,T 


RM 


PERST  CRU,T 
RM 
CRU,T 
RM 
CRU,T 
RM 
CRU,T 
RM 
CRU,T 
RM 
CRU,T 
RM 
CRU,T 
RM 
CRU,T 
RM 
CRU,T 
RM 


PERST 


PERST 


PERST 


PERST 


PERST 


PERST 


PERST 


PERST 


PERST CRU,T 


RM 
PERST  CRU,T 
RM 
PERST  CRU,T 
RM 
CRU,T 
RM 


PERST 


PERST CRU,T 


RM 
PERST  CRU,T 
RM 
PERST  CRU,T 
RM 
PERST  CRU,T 
RM 
PERST  CRU,T 
RM 
CRU,T 
RM 
CRU,T 
RM 


PERST 


PERST 


PERST TRM 


PERST TRM 


System Memory Page 
Table, Page 29. 


System Memory Page 
Table, Page 30. 


System Memory Page 
Table, Page 31. 


LDAT Primary DAT 
LDAT Secondary DAT 
LDAT data out 

LDAT data in 

TAP IDCODE 

TAP substepping 
VCO buffer status 

RX LFSR N EIS 


CREDIT CONSUMED 
COUNTER POSTED 


CREDIT CONSUMED 
COUNTER NON-POSTED 


CREDIT CONSUMED 
COUNTER COMPLETION 


CREDIT LIMIT COUNTER 
POSTED 

CREDIT LIMIT COUNTER 
NON-POSTED 


CREDIT LIMIT COUNTER 
COMPLETION 


CREDIT ALLOCATED 
COUNTER POSTED 


CREDIT ALLOCATED 
COUNTER NON-POSTED 


CREDIT ALLOCATED 
COUNTER COMPLETION 


PSMI ERROR STATUS 


PSMI CONFIG 


TX/RX SEO NUM 


Sample of real time clock 
from the endpoint 


Reset Global Control 


Device Status Register 
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PWR TIMEOUT 


RDBCTL 


RDBSTAT 


CurrentRatio 


IccOverClockO 


IccOverClock1 


IccOverClock2 


IccOverClock3 


COREFREQ 


COREVOLT 


MEMORYFREQ 


MEMVOLT 


SVIDControl 


PCUControl 


HostPMState 


HOSPMState 


C3WakeUp Ti 
mer 


L1 Entry Timer 


C3 Timers 


coprocessor 
OS PCUCONTR 
OL 
SVIDSTATUS 


COMPONENTID 


GBOXPMContr 
ol 


4018 


4020 


4024 


402C 


4040 


4044 


4048 


404C 


4100 


4104 


4108 


410C 


4110 


4114 


4118 


411C 


4120 


4124 


4128 


412C 


4130 


4134 


413C 


4018 


4020 


4024 


402C 


4040 


4044 


4048 


404C 


4100 


4104 


4108 


410C 


4110 


4114 


4118 


411C 


4120 


4124 


4128 


412C 


4130 


4134 


413C 


16408 


16416 


16420 


16428 


16448 


16452 


16456 


16460 


16640 


16644 


16648 


16652 


16656 


16660 


16664 


16668 


16672 


16676 


16680 


16684 


16688 


16692 


16700 


1006 


1008 


1009 


100B 


1010 


1011 


1012 


1013 


1040 


1041 


1042 


1043 


1044 


1045 


1046 


1047 


1048 


1049 


104A 


104B 


104C 


104D 


104F 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging 


Paging 


Paging 


Paging 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


RTL PERST TRM 


RTL CSR G TRM 
RPA 

RTL CSR G TRM 
RPA 

RTL PERST TRM 


TRM 


TRM 


TRM 


TRM 


RTL PERST SNAR 
F 
SNAR 
F 
SNAR 
F 
SNAR 
F 
SNAR 
F 


RTL PERST 


RTL PERST 


RTL PERST 


RTL PERST 


SNAR 
F 


RTL PERST, 


SNAR 
F 


RTL PERST, 


SNAR 
F 


RTL PERST, 


RTL PERST SNAR 


F 


RTLO PERST, SNAR 
THER HOT F 


RTLO . PERST SNAR 

THER F 

RTL PERST, SNAR 
HOT F 

RTL PERST SNAR 

F 

SNAR 

F 

SNAR 

F 


PERST 


RTL PERST, 


Timeout value used in 
the reset engine to 
timeout various reset 
external events. Slot 
Power, GrpBPwrGd 
assertion, Connector 
status timeout period. 
The number in this 
register is used to shift 1 
N places. N has to be less 
than 32 

Reset Debug Control 
Register 

Reset Debug Status 
Register 

The expected MCLK 
Ratio that is sent to the 
corepll 

core Overclocking Only, 
protected by 
overclocking disable fuse 
(OverclockDis) 

Mem Overclocking Only, 
protected by 
overclocking disable fuse 
(OverclockDis) 

Display Bend1, Always 
open, no fuse protection 


Display Bend2, Always 
open, no fuse protection 


Core Frequency 
Core Voltage 
Memory Frequency 
Memory Voltage 


SVID VR12/MVP7 Control 
Interface Register 


Power Control Unit 
Register 


Host PM scratch 
registers 


coprocessor OS PM 
Scratch registers 


C3 Wakeup Timer 
Control for autoC3 


L1 Entry Timer 


C3 Entry and Exit Timers 
coprocessor OS PCU 
Control CSR.. i.e. not for 
host consumption 

SVID Status 
COMPONENTID 


GBOX PM Control 
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GPIO Input St 
atus 


GPIO Output C 
ontrol 
EMON Control 


EMON Counter 
0 

PCIE VENDOR 
ID DEVICE ID 


PCIE PCI COM 
MAND AND ST 
ATUS 
PCIE PCI REVIS 
ION ID AND C 
0X8 
PCIE PCI CACH 
E LINE SEL 
OXC 

PCIE MEMORY 
_BAR_O 
PCIE_UPPER_D 
WORD_OF_ME 
MOR_0X14 
PCIE_10_BAR_2 


PCIE_MEMORY 
BAR 1 

PCIE UPPER D 
WORD OF ME 
MOR 0X24 
PCIE PCI SUBS 
YSTEM 

PCIE EXPANSIO 
N ROM BAR 


PCIE PCI CAPA 
BILITIES POINT 
ER 

PCIE PCI INTER 
RUPT LINE PIN 


PCIE PCI PM C 
APABILITY 
PCIE PM STAT 
US AND CONT 
RO. 0X48 

PCIE PCIE CAP 
ABILITY 

PCIE DEVICE C 
APABILITY 

PCIE DEVICE C 
ONTROL AND - 
STATUS 

PCIE UNK CAP 
ABILITY 

PCIE LINK CON 
TROL AND STA 
_OX5C 
PCIE_DEVICE_C 
APABILITY_2 


PCIE_DEVICE_C 
ONTROL_AND_ 
S 0X74 
PCIE_LINK_CON 
TROL_AND_STA 
TUS 2 


4140 


4144 


4160 


4164 


5800 


5808 


580C 


5810 


5814 


5818 


5820 


5824 


582C 


5830 


5834 


583C 


5844 


5848 


584C 


5850 


5854 


5858 


585C 


5870 


5874 


587C 


4140 


4144 


4160 


4164 


5800 


5804 


5808 


580C 


5810 


5814 


5818 


5820 


5824 


582C 


5830 


5834 


583C 


5844 


5848 


584C 


5850 


5854 


5858 


585C 


5870 


5874 


587C 


16704 


16708 


16736 


16740 


22528 


22532 


22536 


22540 


22544 


22548 


22552 


22560 


22564 


22572 


22576 


22580 


22588 


22596 


22600 


22604 


22608 


22612 


22616 


22620 


22640 


22644 


22652 


1050 


1051 


1058 


1059 


1600 


1601 


1602 


1603 


1604 


1605 


1606 


1608 


1609 


160B 


160C 


160D 


160F 


1611 


1612 


1613 


1614 


1615 


1616 


1617 


161C 


161D 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


RTL 


RTL 


RTL 


RTL 


PERST, 
OTHER 


PERST 


PERST 


PERST 


SNAR 
F 


SNAR 
F 
SNAR 
F 
SNAR 
F 


GPIO Input Status 


GPIO Output Control 
EMON Control Register 


EMON Counter 0 
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PCIE MSI CAP 5888 5888 22664 1622 32 1 Ringo Paging No Yes 
ABILITY 

PCIE MESSAGE 588C 588C 22668 1623 32 1 RingO Paging No Yes 
ADDRESS 

PCIE MESSAGE 5890 5890 22672 1624 32 1 Ringo Paging No 
.UPPER ADDR 

ESS 

PCIE MESSAGE 5894 5894 22676 1625 32 1 Ringo Paging No Yes 
_DATA 

PCIE_MSIX_CA 5898 5898 22680 1626 32 1 Ringo Paging No Yes 
PABILITY 

PCIE_MSIX_TA 589C 589C 22684 1627 32 1 Ringo Paging No 
BLE_OFFSET_BI 

R 

PCIE_PBA_OFFS 58A0 58A0 22688 1628 32 1 Ringo Paging No Yes 
ET_BIR 

PCIE_ADVANCE 5900 5900 22784 1640 32 1 RingO Paging No 

D ERROR CAP 

ABILITY 

PCIE UNCORRE 5904 5904 22788 

CTABLE ERROR 

_0X104 

PCIE UNCORRE 22792 

CTABLE ERROR 

_MASK 

PCIE UNCORRE 590C 22796 

CTABLE ERROR 

_0X10C 

PCIE_CORRECT 22800 Paging 
ABLE_ERROR_S 

TATUS 

PCIE_CORRECT 5914 22804 Paging 
ABLE_ERROR_ 

MASK 

PCIE_ADVANCE 22808 Paging 
D_ERROR_CAP 

A_0X118 

PCIE_ERROR_H 591C 22812 Paging 

EADER LOG D 

WORD 0 

PCIE ERROR H 22816 Paging 

EADER LOG D 

WORD 1 

PCIE ERROR H 5924 22820 Paging 

EADER LOG D 

WORD 2 

PCIE ERROR H 22824 Paging 

EADER LOG D 

WORD 3 

PCIE LTSSM ST 5C00 5C00 23552 1700 32 1 Ringo Paging No Yes 
ATE CONTROL 


PCIE LTSSM ST 5C04 5C04 23556 1701 32 1 Ringo Paging No Yes 
ATE_STATUS 


PCIE_SKIP_FRE 5C08 5C08 23560 1702 32 1 Ringo Paging No 
QUENCY_TIME 

R 

PCIE_LANE_SEL 5COC 5COC 23564 1703 32 1 Ringo Paging No Yes 
ECT 

PCIE LANE DES 5C10 5C10 23568 1704 32 1 Ringo Paging No Yes 
KEW 

PCIE RECEIVER 5C14 5C14 23572 1705 32 1 RingO Paging No 
.ERROR STATU 

S 

PCIE LANE NU 5C18 5C18 23576 1706 32 1 RingO Paging No 
MBER CONTRO 

L 
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PCIE N FTS CO 5C1C 5C1C 23580 1707 32 1 Ringo Paging No Yes 
NTROL 

PCIE LINK STA 5C20 5C20 23584 1708 32 1 Ringo Paging No Yes 
TUS 

PCIE SYNC BYP 5C2C 5C2C 23596 170B 32 1 Ringo Paging No Yes 
ASS 

PCIE ACK REPL 5C38 5C38 23608 170E 32 1 Ringo Paging No 

AY TIMEOUT 


PCIE SEQUENC 5C3C 5C3C 23612 170F 32 1 RingO Paging No 

E NUMBER ST 

ATUS 

PCIE GPEX PM 5C50 5C50 23632 1714 32 1 Ringo Paging No Yes 
. TIMER 

PCIE PME TIM 5C54 5C54 23636 1715 32 1 RingO Paging No Yes 
EOUT 

PCIE ASPM L1 5C58 5C58 23640 1716 32 1 Ringo Paging No Yes 
. TIMER 

PCIE ASPM RE 5C5C 5C5C 23644 1 Ringo Paging No 
QUEST TIMER 


PCIE ASPM L1 5C60 5C60 23648 1718 32 1 Ringo Paging No Yes 
` DISABLE 

PCIE ADVISOR 5C68 5C68 23656 171A 32 1 RingO Paging No 

Y ERROR CON 

TROL 

PCIE GPEX ID 5C70 5C70 23664 171C 32 1 Ringo Paging No Yes 


PCIE GPEX CLA 5C74 5C74 23668 171D 32 1 Ringo Paging No Yes 
SSCODE 

PCIE GPEX SU 5C78 5C78 23672 1 Ringo Paging No 
BSYSTEM ID 


PCIE GPEX DE 5C7C 5C7C 23676 171F 32 1 RingO Paging No 
VICE_CAPABILI 

TY 

PCIE_GPEX_LIN 5C80 5C80 23680 1720 32 1 RingO Paging No 
K_CAPABILITY 


PCIE_GPEX_PM 5C88 5C88 23688 1722 32 1 Ringo Paging No Yes 
. CAPABILITY 


PCIE GPEX LIN 5C9C 5C9C 23708 1727 32 1 RingO Paging No 

K CONTROL ST 

ATUS 

PCIE ERROR C 5CAC 5CAC 23724 172B 32 1 Ringo Paging No Yes 
OUNTER 

PCIE CONFIGU 5CBO 5CBO 23728 172C 32 1 RingO Paging No 
RATION, READY 


PCIE FC UPDA 5CB8 5CB8 23736 172E 32 1 RingO Paging No 
TE TIMEOUT 


PCIE FC UPDA 5CBC 5CBC 23740 172F 32 1 Ringo Paging No Yes 
TE TIMER 

PCIE LOAD VC 5CC8 5CC8 23752 1732 32 1 RingO Paging No 
.BUFFER SIZE 


PCIE VC BUFFE 5CCC 5CCC 23756 1733 32 1 RingO Paging No 

R SIZE THRESH 

OLD 

PCIE VC BUFFE 5CDO 5CDO 23760 1734 32 1 Ringo Paging No Yes 
R SELECT 

PCIE BAR ENA 5CD4 5CD4 23764 1735 32 1 Ringo Paging No Yes 
BLE 

PCIE_BAR_SIZE 5CD8 5CD8 23768 1736 32 1 RingO Paging No 
.LOWER DWO 

RD 

PCIE BAR SIZE 5CDC 5CDC 23772 1737 32 1 RingO Paging No 
.UPPER DWOR 

D 
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PCIE BAR SELE 
CT 

PCIE CREDIT C 
OUNTER SELEC 
T 

PCIE CREDIT C 
OUNTER STAT 
US 
PCIE TLP HEAD 
ER SELECT 
PCIE TLP HEAD 
ER LOG DWOR 
DO 
PCIE TLP HEAD 
ER LOG DWOR 
D1 
PCIE TLP HEAD 
ER LOG DWOR 
D2 
PCIE TLP HEAD 
ER LOG DWOR 
D3 
PCIE_RELAXED_ 
ORDERING_CO 
NTROL 
PCIE_BAR_PREF 
ETCH 
PCIE_FC_CHECK 
. CONTROL 


PCIE FC UPDA 
TE TIMER TRA 
FFIC 


PCIE UNCORRE 
CTABLE ERROR 
_OX5FO 
PCIE_CLOCK_G 
ATING_CONTR 
OL 
PCIE_GEN2_CO 
NTROL_CSR 


PCIE_GPEX_IP_ 
RELEASE_VERSI 
ON 

AFEBNDO_CFGO 


AFEBND1_CFGO 


AFEBND2_CFGO 


AFEBND3_CFGO 


AFEBND4_CFGO 


AFEBND5 CFGO 


AFEBND6 CFGO 


AFEBND7 CFGO 


5CEO 


5CE4 


5CE8 


5CEC 


5D00 


5D04 


5D08 


5D18 


5CEO 


5CE4 


5CE8 


5CEC 


5CFO 


5D04 


5D08 


5D18 


23776 


23780 


23784 


23788 


23792 


23796 


23800 


23804 


23808 


23812 


23816 


23832 


1738 


1739 


173A 


173B 


173C 


1741 


1742 


1746 


32 


32 


32 


32 


32 


32 


32 


32 


1 


1 


1 


1 


1 


1 


1 


1 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging 


Paging 


Paging 


Paging 


Paging No 


Paging No 


Paging No 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Yes 


Yes 


Yes 


Yes 


AFE Bundle Config 
Register 0, Bundle 0, 
0x6000 

AFE Bundle Config 
Register 0, Bundle 1, 
0x6004 

AFE Bundle Config 
Register 0, Bundle 2, 
0x6008 

AFE Bundle Config 
Register 0, Bundle 3, 
Ox600C 

AFE Bundle Config 
Register 0, Bundle 4, 
0x6010 

AFE Bundle Config 
Register 0, Bundle 5, 
0x6014 

AFE Bundle Config 
Register 0, Bundle 6, 
0x6018 

AFE Bundle Config 
Register 0, Bundle 7, 
0x601C 
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AFEBNDO CFG1 


AFEBND1 CFG1 


AFEBND2 CFG1 


AFEBND3 CFG1 


AFEBND4_CFG1 


AFEBND5 CFG1 


AFEBND6 CFG1 


AFEBND7 CFG1 


AFEBNDO CFG2 


AFEBND1 CFG2 


AFEBND2 CFG2 


AFEBND3 CFG2 


AFEBND4_CFG2 


AFEBND5 CFG2 


AFEBND6 CFG2 


AFEBND7 CFG2 


AFEBNDO CFG3 


AFEBND1 CFG3 


AFEBND2 CFG3 


AFEBND3 CFG3 


AFERNDA CFG3 


AFEBND5 CFG3 


AFEBND6 CFG3 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


AFE Bundle Config 
Register 1, Bundle 0, 
0x6020 

AFE Bundle Config 
Register 1, Bundle 1, 
0x6024 

AFE Bundle Config 
Register 1, Bundle 2, 
0x6028 

AFE Bundle Config 
Register 1, Bundle 3, 
0x602C 

AFE Bundle Config 
Register 1, Bundle 4, 
0x6030 

AFE Bundle Config 
Register 1, Bundle 5, 
0x6034 

AFE Bundle Config 
Register 1, Bundle 6, 
0x6038 

AFE Bundle Config 
Register 1, Bundle 7, 
0x603C 

AFE Bundle Config 
Register 2, Bundle 0, 
0x6040 

AFE Bundle Config 
Register 2, Bundle 1, 
0x6044 

AFE Bundle Config 
Register 2, Bundle 2, 
0x6048 

AFE Bundle Config 
Register 2, Bundle 3, 
0x604C 

AFE Bundle Config 
Register 2, Bundle 4, 
0x6050 

AFE Bundle Config 
Register 2, Bundle 5, 
0x6054 

AFE Bundle Config 
Register 2, Bundle 6, 
0x6058 

AFE Bundle Config 
Register 2, Bundle 7, 
0x605C 

AFE Bundle Config 
Register 3, Bundle 0, 
0x6060 

AFE Bundle Config 
Register 3, Bundle 1, 
0x6064 

AFE Bundle Config 
Register 3, Bundle 2, 
0x6068 

AFE Bundle Config 
Register 3, Bundle 3, 
0x606C 

AFE Bundle Config 
Register 3, Bundle 4, 
0x6070 

AFE Bundle Config 
Register 3, Bundle 5, 
0x6074 

AFE Bundle Config 
Register 3, Bundle 6, 
0x6078 
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AFEBND7 CFG3 


AFEBNDO CFG4 


AFEBND1_CFG4 


AFEBND2_CFG4 


AFEBND3_CFG4 


AFEBND4_CFG4 


AFEBND5 CFG4 


AFEBND6 CFG4 


AFEBND7 CFG4 


AFEBNDO LBC5 


AFEBND1 LBC5 


AFEBND2 LBC5 


AFEBND3 LBC5 


AFEBND4 LBC5 


AFEBND5 LBC5 


AFEBND6 LBC5 


AFEBND7 LBC5 


AFELNO CFGO 


AFELN1 CFGO 


AFELN2_CFGO 


AFELN3_CFGO 


AFELN4_CFGO 


AFELNS_CFGO 


607C 


6080 


6084 


6088 


608C 


6090 


6094 


6098 


609C 


60A0 


60A4 


60A8 


60AC 


60B0 


60B4 


60B8 


60BC 


60C0 


60C4 


60C8 


60CC 


60D0 


60D4 


607C 


6080 


6084 


6088 


608C 


6090 


6094 


6098 


609C 


60A0 


60A4 


60A8 


60AC 


60B0 


60B4 


60B8 


60BC 


60CO 


60C4 


60C8 


60CC 


60D0 


60D4 


24700 


24704 


24708 


24712 


24716 


24720 


24724 


24728 


24732 


24736 


24740 


24744 


24748 


24752 


24756 


24760 


24764 


24768 


24772 


24776 


24780 


24784 


24788 


181F 


1820 


1821 


1822 


1823 


1824 


1825 


1826 


1827 


1828 


1829 


182A 


182B 


182C 


182D 


182E 


182F 


1830 


1831 


1832 


1833 


1834 


1835 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


1 Ringo 


1 Ringo 


1 Ringo 


1 Ringo 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging No 


Paging No 


Paging No 


Paging No 


Yes 


Yes 


AFE Bundle Config 
Register 3, Bundle 7, 
0x607C 

AFE Bundle Config 
Register 4, Bundle 0, 
0x6080 

AFE Bundle Config 
Register 4, Bundle 1, 
0x6084 

AFE Bundle Config 
Register 4, Bundle 2, 
0x6088 

AFE Bundle Config 
Register 4, Bundle 3, 
0x608C 

AFE Bundle Config 
Register 4, Bundle 4, 
0x6090 

AFE Bundle Config 
Register 4, Bundle 5, 
0x6094 

AFE Bundle Config 
Register 4, Bundle 6, 
0x6098 

AFE Bundle Config 
Register 4, Bundle 7, 
0x609C 

AFE Bundle Load Bus 
Control Register 5, 
Bundle 0, 0x60A0 
AFE Bundle Load Bus 
Control Register 5, 
Bundle 1, 0x60A4 
AFE Bundle Load Bus 
Control Register 5, 
Bundle 2, 0x60A8 
AFE Bundle Load Bus 
Control Register 5, 
Bundle 3, Ox60AC 
AFE Bundle Load Bus 
Control Register 5, 
Bundle 4, 0x60BO 
AFE Bundle Load Bus 
Control Register 5, 
Bundle 5, 0x60B4 
AFE Bundle Load Bus 
Control Register 5, 
Bundle 6, 0x60B8 
AFE Bundle Load Bus 
Control Register 5, 
Bundle 7, Ox60BC 
AFE Lane Config Register 
0, Lane 0, Ox60CO 


AFE Lane Config Register 
0, Lane 1, 0x60C4 


AFE Lane Config Register 
0, Lane 2, 0x60C8 


AFE Lane Config Register 
0, Lane 3, Ox60CC 


AFE Lane Config Register 
0, Lane 4, Ox60DO 


AFE Lane Config Register 
0, Lane 5, 0x60D4 
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AFELN6 CFGO 


AFELN7 CFGO 


AFELN8 CFGO 


AFELN9 CFGO 


AFELN10 CFGO 


AFELN11 CFGO 


AFELN12 CFGO 


AFELN13 CFGO 


AFELN14 CFGO 


AFELN15 CFGO 


AFELNO CFG1 


AFELN1 CFG1 


AFELN2_CFG1 


AFELN3_CFG1 


AFELN4_CFG1 


AFELN5 CFG1 


AFELN6 CFG1 


AFELN7 CFG1 


AFELN8 CFG1 


AFELN9 CFG1 


AFELN10 CFG1 


AFELN11 CFG1 


AFELN12 CFG1 


AFELN13 CFG1 


AFELN14 CFG1 


60D8 


60DC 


60E0 


60E4 


60E8 


60EC 


60F0 


60F4 


60F8 


60FC 


6100 


6104 


6108 


610C 


6110 


6114 


6118 


611C 


6120 


6124 


6128 


612C 


6130 


6134 


6138 


60D8 


60DC 


60E0 


60E4 


60E8 


60EC 


60FO 


60F4 


60F8 


60FC 


6100 


6104 


6108 


610C 


6110 


6114 


6118 


611C 


6120 


6124 


6128 


612C 


6130 


6134 


6138 


24792 


24796 


24800 


24804 


24808 


24812 


24816 


24820 


24824 


24828 


24832 


24836 


24840 


24844 


24848 


24852 


24856 


24860 


24864 


24868 


24872 


24876 


24880 


24884 


24888 


1836 


1837 


1838 


1839 


183A 


183B 


183C 


183D 


183E 


183F 


1840 


1841 


1842 


1843 


1844 


1845 


1846 


1847 


1848 


1849 


184A 


184B 


184C 


184D 


184E 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


AFE Lane Config Register 
0, Lane 6, 0x60D8 


AFE Lane Config Register 
0, Lane 7, 0x60DC 


AFE Lane Config Register 
0, Lane 8, 0x60EO 


AFE Lane Config Register 
0, Lane 9, Ox60E4 


AFE Lane Config Register 
0, Lane 10, Ox60E8 


AFE Lane Config Register 
0, Lane 11, Ox60EC 


AFE Lane Config Register 
0, Lane 12, 0x60FO 


AFE Lane Config Register 
0, Lane 13, Ox60F4 


AFE Lane Config Register 
0, Lane 14, 0x60F8 


AFE Lane Config Register 
0, Lane 15, Ox60FC 


AFE Lane Config Register 
1, Lane 0, 0x6100 


AFE Lane Config Register 
1, Lane 1, 0x6104 


AFE Lane Config Register 
1, Lane 2, 0x6108 


AFE Lane Config Register 
1, Lane 3, 0x610C 


AFE Lane Config Register 
1, Lane 4, 0x6110 


AFE Lane Config Register 
1, Lane 5, 0x6114 


AFE Lane Config Register 
1, Lane 6, 0x6118 


AFE Lane Config Register 
1, Lane 7, 0x611C 


AFE Lane Config Register 
1, Lane 8, 0x6120 


AFE Lane Config Register 
1, Lane 9, 0x6124 


AFE Lane Config Register 
1, Lane 10, 0x6128 


AFE Lane Config Register 
1, Lane 11, 0x612C 


AFE Lane Config Register 
1, Lane 12, 0x6130 


AFE Lane Config Register 
1, Lane 13, 0x6134 


AFE Lane Config Register 
1, Lane 14, 0x6138 
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AFELN15 CFG1 


AFELNO CFG2 


AFELN1 CFG2 


AFELN2_CFG2 


AFELN3_CFG2 


AFELN4_CFG2 


AFELN5 CFG2 


AFELN6 CFG2 


AFELN7 CFG2 


AFELN8 CFG2 


AFELN9 CFG2 


AFELN10 CFG2 


AFELN11 CFG2 


AFELN12 CFG2 


AFELN13 CFG2 


AFELN14 CFG2 


AFELN15 CFG2 


AFECMN_CFGO 


AFECMN_CFG1 


AFECMN_CFG2 


AFECMN_CFG3 


AFECMN_CFG4 


AFECMN_CFG5 


AFECMN_CFG6 


AFECMN_CFG7 


613C 


6140 


6144 


6148 


614C 


6150 


6154 


6158 


615C 


6160 


6164 


6168 


616C 


6170 


6174 


6178 


617C 


6180 


6184 


6188 


618C 


6190 


6194 


6198 


619C 


613C 


6140 


6144 


6148 


614C 


6150 


6154 


6158 


615C 


6160 


6164 


6168 


616C 


6170 


6174 


6178 


617C 


6180 


6184 


6188 


618C 


6190 


6194 


6198 


619C 


24892 


24896 


24900 


24904 


24908 


24912 


24916 


24920 


24924 


24928 


24932 


24936 


24940 


24944 


24948 


24952 


24956 


24960 


24964 


24968 


24972 


24976 


24980 


24984 


24988 


184F 


1850 


1851 


1852 


1853 


1854 


1855 


1856 


1857 


1858 
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AFE Lane Config Register 
2, Lane 1, 0x6144 
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2, Lane 2, 0x6148 
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2, Lane 4, 0x6150 
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AFE Lane Config Register 
2, Lane 8, 0x6160 
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2, Lane 14, 0x6178 
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Register 7, 0x619C 
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AFE Read Status Register 
0, Ox61AC 
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0x6214 
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Register, 0x621c 


PCS PSMI Status register, 
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Configuration Register, 
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Configuration Register, 
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Configuration Register, 
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Configuration Register, 
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Register 1, 0x6240 
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Register 5, Address - 
0x6314 

LPIVBPHY Trace Bus Data 
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41412 


284C 


2850 


2851 


2852 


2853 


2854 


2855 


2856 


2857 


2858 


2859 


285A 


285B 


285C 


2860 


2861 


2862 


2863 


2864 


2865 


2866 


2867 


2868 


2869 


286A 


286B 


286C 


2870 


2871 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


RTL STICKY 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL 


RTL STICKY 
B 
RTL STICKY 


RTL 


RTL 


RTL 


RTL 


RTL 
RTL 


RTL STICKY 
.B 
RTL STICKY 


RTL 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


DMA Channel Error 
Register 

DMA Channel Attributes 
Register 


Descriptor Head Pointer 
Register 


Descriptor Tail Pointer 
Register 
DMA Auxiliary Register O 


DMA Auxiliary Register O 


Descriptor Ring 
Attributes Register 


Descriptor Ring 
Attributes Register 


DMA Interrupt Timer 
Register 

DMA Channel Status 
Register 

DMA Tail Pointer Write 
Back Register Lo 


DMA Tail Pointer Write 
Back Register 


DMA Channel Error 
Register 

DMA Channel Error 
Register 

DMA Channel Attributes 
Register 


Descriptor Head Pointer 
Register 


Descriptor Tail Pointer 
Register 
DMA Auxiliary Register O 


DMA Auxiliary Register O 


Descriptor Ring 
Attributes Register 


Descriptor Ring 
Attributes Register 


DMA Interrupt Timer 
Register 

DMA Channel Status 
Register 

DMA Tail Pointer Write 
Back Register Lo 


DMA Tail Pointer Write 
Back Register 


DMA Channel Error 
Register 

DMA Channel Error 
Register 

DMA Channel Attributes 
Register 


Descriptor Head Pointer 
Register 
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DTPR 7 


DAUX LO 7 


DAUX HI 7 


DRAR LO 7 


DRAR HI 7 


DITR 7 


DMA DSTAT 7 


DTPWBR LO 7 


DTPWBR HI 7 


DCHERR 7 
DCHERRMSK 7 
DMA REQUEST 


SIZE 
DCR 


DQDR TL 


DQDR TR 


DQDR BL 


DQDR BR 


DMA SPAREO 


TRANSLIMIT 


DMA SPARE2 


DMA MISC 


CR MEMREGIO 
N BASE 


CP MEMREGIO 
N TOP 

PSMIA 0 

PSMIA 1 

PSMIA 2 


PSMIA 3 


PSMIA 4 


A1C8 


A1CC 


A1DO 


A1D4 


A1D8 


A1DC 


A1E0 


A1E4 


A1E8 


A1EC 


A1FO 


A200 


A280 


A284 


A288 


A28C 


A294 


A298 


A29C 


A2A0 


A2A4 


A2BO 


A2B4 


A3A0 


A3A4 


A3A8 


A3AC 


A3BO 


A1C8 


A1CC 


A1DO 


A1D4 


A1D8 


A1DC 


A1E0 


A1E4 


A1E8 


A1EC 


A1FO 


A200 


A280 


A284 


A288 


A28C 


A290 


A294 


A298 


A29C 


A2A0 


A2A4 


A2BO 


A2B4 


A3A0 


A3A4 


A3A8 


A3AC 


A3BO 


41416 


41420 


41424 


41428 


41432 


41436 


41440 


41444 


41448 


41452 


41456 


41472 


41600 


41604 


41608 


41612 


41616 


41620 


41624 


41628 


41632 


41636 


41648 


41652 


41888 


41892 


41896 


41900 


41904 


2872 


2873 


2874 


2875 


2876 


2877 


2878 


2879 


287A 


287B 


287C 


2880 


28A0 


28A1 


28A2 


28A3 


28A4 


28A5 


28A6 


28A7 


28A8 


28A9 


28AC 


28AD 


28E8 


28E9 


28EA 


28EB 


28EC 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging No 


Paging No 


Paging No 


Paging 


Paging 


Paging 


Paging 


Paging 


Paging Yes 


Paging Yes 


Paging Yes 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


RTL 


RTL 


RTL 
RTL 


RTL STICKY 
B 

RTL STICKY 
B 

RTL CSR_G 
RPB 

RTL CSR_G 
RPB 

RTL CSR_G 
RPB 


RTL CSR_G 
RPB 


RTL CSR_G 
RPB 


RTL CSR_G 
RPB 


RTL CSR_G 
RPB 


RTL CSR_G 
RPB 


RTL CSR_G 
RPB 

RTL CSR_G 
RPB 


RTL CSR_G 
RPB 


RTL CSR_G 
RPB 


RTL CSR_G 
RPB 
RTL CSR_G 
RPB 
RTL CSR_G 
RPB 
RTL CSR_G 
RPB 
RTL CSR_G 
RPB 
RTL CSR_G 
RPB 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


Descriptor Tail Pointer 
Register 
DMA Auxiliary Register O 


DMA Auxiliary Register O 


Descriptor Ring 
Attributes Register 


Descriptor Ring 
Attributes Register 


DMA Interrupt Timer 
Register 

DMA Channel Status 
Register 

DMA Tail Pointer Write 
Back Register Lo 


DMA Tail Pointer Write 
Back Register 


DMA Channel Error 
Register 

DMA Channel Error 
Register 

DMA Request control 


DMA Configuration 
Register 

Descriptor Queue Access 
Register 


Descriptor Queue Data 
Register Top Left 


Descriptor Queue Data 
Register Top Right 


Descriptor Queue Data 
Register Bottom Left 


Descriptor Queue Data 
Register Bottom Right 


Spare DMA register full 
32-bits available 


TPE Attributes Register 


Spare DMA register full 
32-bits available 


Misc bits such as chicken 
bits -- etc... 


Base Address: Address at 
4K granularity 


Memory Region Length 
(4K) 

PSMI register 

PSMI register 

PSMI register 


PSMI register 


PSMI register 
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PSMIA 5 


PSMIA 6 


PSMIA 7 


PSMIB 0 


PSMIB 1 


PSMIB 2 


PSMIB 3 


PSMIB 4 


PSMIB 5 


PSMIB 6 


PSMIB 7 


PSMIC 0 


PSMIC 1 


PSMIC 2 


PSMIC 3 


PSMIC 4 


PSMIC 5 


PSMIC 6 


PSMIC 7 


DMA LOCK 


APICIDR 


APICVER 


APICAPR 


APICRT 


APICICR 


MCA INT. STAT 


MCA INT EN 


SCRATCH 


PSMI_MEMSHA 
DOW_CNTRL 


PSMI_EN 


PSMI TIM CNT 
RL 

CONCAT CORE 
_HALTED 


A3B4 


A3B8 


A3BC 


A3CO 


A3C4 


A3C8 


A3CC 


A3DO 


A3D4 


A3D8 


A3DC 


A3E0 


A3E4 


A3E8 


A3EC 


A3FO 


A3F4 


A3F8 


A3FC 


A400 


A800 


A804 


A808 


A840 


A9DO 


ABOO 


ABOA 


AB20 


ACOO 


ACOA 


ACOS8 


ACOC 


A3B4 


A3B8 


A3BC 


A3CO 


A3C4 


A3C8 


A3CC 


A3DO 


A3D4 


A3D8 


A3DC 


A3E0 


A3E4 


A3E8 


A3EC 


A3FO 


A3F4 


A3F8 


A3FC 


A400 


A800 


A804 


A808 


A908 


AA08 


ABOO 


AB04 


AB5C 


ACOO 


ACOA 


ACOS8 


ACOC 


41908 


41912 


41916 


41920 


41924 


41928 


41932 


41936 


41940 


41944 


41948 


41952 


41956 


41960 


41964 


41968 


41972 


41976 


41980 


41984 


43008 


43012 


43016 


43072 


43472 


43776 


43780 


43808 


44032 


44036 


44040 


44044 


28ED 


28EE 


28EF 


28F0 


28F1 


28F2 


28F3 


28F4 


28F5 


28F6 


28F7 


28F8 


28F9 


28FA 


28FB 


28FC 


28FD 


28FE 


28FF 


2900 


2A00 


2A01 


2A02 


2A10 


2A74 


2ACO 


2AC1 


2AC8 


2B00 


2B01 


2B02 


2B03 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


64 


64 


32 


32 


32 


32 


32 


32 


64 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


26 


8 


1 


1 


16 


1 


1 


1 


1 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging Yes 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging No 


Paging No 


Paging No 


Paging No 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 
RTL CSR G 
RPB 


RTL CSR G 
RPB 

RTL CSR G 
RPB 

RTL STICKY 
B 

RTL CSR_G 
RPB 


RTL CSR_G 
RPB 

RTL CSR_G 
RPB 

RTL CSR_G 
RPB 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


PSMI register 

PSMI register 

PSMI register 

PSMI register 

PSMI register 

PSMI register 

PSMI register 

PSMI register 

PSMI register 

PSMI register 

PSMI register 

PSMI register 

PSMI register 

PSMI register 

PSMI register 

PSMI register 

PSMI register 

PSMI register 

PSMI register 
Master Lock register 
APIC Identification 
Register 

APIC Version Register 
APIC Priority Register 


APIC Redirection Table 


APIC Interrupt Command 
Register 0 to 7 


MCA Interrupt Status 
Register 

MCA Interrupt Enable 
Register 

Scratch Registers for 
Software 

PSMI shadow memory 
size bits 


PSMI Enable bits 
PSMI Time Control 


Concatenated core 
halted status 
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FORCE BPM Forces the BPM output 
pins for instrumentation 
purposes 

CORE HALTED Paging Core of same number 
writes a 1 just before 
halting 

AGF CONTROL Paging This register contains the 
AGF control fields 


AGF PERIOD N Paging This register contains the 
8 possible AGF period 
settings used for 
dynamic frequency 
changes. See also 
AGF MASTER MCLK RA 
NGE, 

AGF MASTER DELAY O 
UT, 

AGF MASTER DELAY IN 
and AGF MCLK RANGE 

AGF MASTER ` AFF4 AFF4 45044 2BFD 32 Ring 0 Paging No Yes RTL CSR_G CRU This register contains the 

DELAY_IN RPB 8 possible AGF master 
input delay settings used 
for dynamic frequency 
changes. See also 
AGF_MASTER_MCLK_RA 
NGE, 
AGF_MASTER_DELAY_O 
UT, 
AGF_MASTER_DELAY_IN 
and AGF_MCLK_RANGE 

AGF_MASTER_ AFF8 AFF8 45048 2BFE 32 Ring 0 Paging No Yes RTL CSR_G CRU This register contains the 

DELAY_OUT RPB 8 possible AGF master 
output delay settings 
used for dynamic 
frequency changes. See 
also 
AGF_MASTER_MCLK_RA 
NGE, 
AGF_MASTER_DELAY_O 
UT, 
AGF_MASTER_DELAY_IN 
and AGF_MCLK_RANGE 

AGF_MASTER_ AFFC AFFC 45052 2BFF 32 Ring 0 Paging No Yes RTL CSR_G CRU This register contains the 

MCLK RANGE RPB 8 range selects which are 
used to determine the 
parameters used by the 
AGF. The MCLK 
frequency is 
communicated on 6 bits. 
The ranges are 
cumulative. For example 
the range for group 2 is 
RANGE O+RANGE 1to 
RANGE O+RANGE 1+RA 
NGE 2. See also 
AGF MASTER MCLK RA 
NGE, 

AGF MASTER DELAY O 
UT, 

AGF MASTER DELAY IN 
and AGF MCLK RANGE 

RDMASRO B180 B180 45440 2C60 32 1 Ringo Paging Yes Yes RTL £ CRU Remote DMA register 


RDMASR1 B184 B184 45444 2C61 32 1 Ringo Paging Yes Yes RTL [| CRU Remote DMA register 


RDMASR2 B188 B188 45448 2C62 32 1 Ringo Paging Yes Yes RTL = CRU Remote DMA register 


RDMASR3 B18C B18C 45452 2C63 32 1 Ringo Paging Yes Yes RTL £ CRU Remote DMA register 
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RDMASR4 
RDMASR5 
RDMASR6 
RDMASR7 


C6, SCRATCH 


APR PHY BASE 
RS CR ADAK C 
TL 

RS CR BL CTL 
SBQ LOCK 


BSPID 


UNSTALL DELA 
Y 


SBOX RS, EMO 
N Selectors 


SBOX EMON C 
NT. OVFL 


EMON CNTO 
EMON CNT1 
EMON CNT2 
EMON CNT3 


SBQ, MISC 


SPARE1 


SPARE2 


SPARE3 


DBOX BW RES 
ERVATION 


B190 


B194 


B198 


B19C 


C11C 


ccoo 


CCO4 


CCOC 


CC10 


CC14 


CC20 


CC24 


CC28 


CC2C 


CC30 


CC34 


CC38 


CC3C 


CC40 


CC50 


CC54 


CC58 


CC5C 


CC60 


CC64 


CC68 


B190 


B194 


B198 


B19C 


C054 


C11C 


ccoo 


CCO4 


CCOC 


CC10 


CC14 


CC20 


CC24 


CC28 


CC2C 


CC30 


CC34 


CC38 


CC3C 


CC40 


CC44 


CC50 


CC54 


CC58 


CC5C 


CC60 


CC64 


CC68 


45456 


45460 


45464 


45468 


49152 


49436 


52224 


52228 


52236 


52240 


52244 


52256 


52260 


52264 


52268 


52272 


52276 


52280 


52284 


52288 


52292 


52304 


52308 


52312 


52316 


52320 


52324 


52328 


2C64 


2C65 


2C66 


2C67 


3000 


3047 


3300 


3301 


3303 


3304 


3305 


3308 


3309 


330A 


330B 


330C 


330D 


330E 


330F 


3310 


3311 


3314 


3315 


3316 


3317 


3318 


3319 


331A 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


1 


1 


1 


1 


22 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


1 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Ring 0 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging No 


Paging Yes 


Paging Yes 


Paging Yes 


Paging No 


Paging No 


Paging No 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Paging No 


Yes RTL CSR G 
RPB 
Yes RTL CSR G 
RPB 
Yes RTL CSR G 
RPB 
Yes RTL CSR G 
RPB 
CSR G 
RPB 


Yes RTL 


Yes RTL CSR G 
RPB 

Yes RTL CSR G 
RPB 

Yes RTL CSR G 
RPB 

Yes RTL CSR G 
RPB 

Yes OTHE CSR G 
R RPB 

Yes CSR G 
RPB 


Yes CSR G 
RPB 


Yes CSR G 
RPB 


Yes RTL CSR G 
RPB 
Yes RTL CSR G 
RPB 
Yes RTL CSR G 
RPB 
Yes RTL CSR G 
RPB 
CSR G 
RPB 


Yes RTL 


Yes RTL CSR G 


RPB 


Yes RTL CSR G 
RPB 


Yes RTL CSR G 


RPB 


Yes RTL CSR G 


RPB 


Yes FUSE, CSR G 
FLASH RPB 


Yes FUSE, CSR G 
FLASH RPB 


Yes FUSE, CSR G 
FLASH RPB 


Yes FUSE, CSR G 
FLASH RPB 


Yes FUSE, CSR G 
FLASH RPB 


Yes CSR G 
RPB 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


Remote DMA register 
Remote DMA register 
Remote DMA register 
Remote DMA register 


Scratch Pad registers for 
package-C6 


Ring Stop ADAK Control 
Register 

Ring Stop BL Control 
Register 

Master Lock register 


BSP ID Register 


Number of clocks of 
delay to be inserted after 
all fuse packets sent 
SBOX RS EMON selectors 


This indicates if there's 
any overflow in any 
EMON counter 


EMON counter 0 
EMON counter 1 
EMON counter 2 
EMON counter 3 


Misc register with sbq 
chicken bits, etc. 


Spare register full 32-bits 
available 


Spare register full 32-bits 
available 


Spare register full 32-bits 
available 


8-bits DBOX reservation 
slot value from SW 


Ring Stop Agent 
Configuration Register O 


Ring Stop Agent 
Configuration Register 1 


Ring Stop Agent 
Configuration Register 2 


Ring Stop Agent 
Configuration Register 3 


Ring Stop Agent 
Configuration Register 4 


Ring Stop Agent 
Configuration Register 5 
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RESET. FSM 


EMON TCU CO 
NTROL 
Doorbell INT 


MarkerMessag 
e Disable 
MarkerMessag 
e Assert 
MarkerMessag 


e Send 


SDAT 


DATOUT 


DATINO 


CC78 


CC84 


CC90 


CCAO 


CCA4 


CCA8 


CCE8 


CCFO 


CCF8 


CC78 


CC84 


CC9C 


CCAO 


CCA4 


CCA8 


CCEO 


CCE8 


CCFO 


CCF8 


52344 


52356 


52368 


52384 


52388 


52392 


52448 


52456 


52464 


52472 


331E 


3321 


3324 


3328 


3329 


332A 


3338 


333A 


333C 


333E 


32 


32 


32 


32 


32 


32 


32 


32 


32 


32 


1 Ringo 


1 Ringo 


4 RingO 


1 Ringo 


1 Ringo 


1 Ringo 


1 Ringo 


1 Ringo 


1 Ringo 


1 Ringo 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging Yes 


Paging No 


Yes 


Yes 


Yes 


Yes 


Yes 


Yes 


Lock 
able 


RTL CSR G 
RPB 
RTL CSR G 


RTL 


RTL 


Paging No Lock RTL CSR_G 


able 


RPB 


Paging No Lock RTL CSR_G 


able 


RPB 


Paging No Lock RTL CSR_G 


able 


RPB 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


CRU 


Reset FSM's status 
Register 
TCU EMON Control 


System Doorbell 
Interrupt Command 
Register 0 to 3 

32-bits to disable 
interrupts 

32-bits to assert 
interrupts 

32-bits to log INTSCR 
field of Marker Message 


Primary DAT register for 
LDAT logic 


Secondary DAT register 


Spare register full 32-bits 
available 


DATIN register 
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